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Abstract 


A  second-generation  fully  asynchronous  Fast  Fourier  Transform  (FFT)  processor 
for  space  applications  is  developed  in  this  thesis.  A  high-performance  patented  FFT 
architecture  invented  by  Suter  and  Stevens  was  used  as  the  basis  for  a  16-point  FFT 
(FFT- 16)  processor  design.  A  brief  derivation  of  the  architecture,  the  asynchronous 
design  methodologies  used  and  space-based  integrated  circuit  issues  are  presented.  The 
Synopsys  VLSI  CAD  system  and  a  radiation  tolerant  design  library  developed  by  the  Air 
Force  Research  Laboratory  were  used  to  implement  the  design.  A  critical  building  block 
of  the  FFT- 16,  the  FFT-4,  was  fabricated  as  a  cost-effective  method  to  validate  the  cell 
library  and  the  applied  asynchronous  design  methodologies  before  larger  point  sizes  are 
fabricated. 

Results  from  high-fidelity  simulations  show  that  the  FFT- 16  design  has  an 
efficiency  of  28  nJ/Unit-Transform  and  has  a  worst  case  throughput  of  760  ns. 
Extrapolating  these  results  to  an  FFT- 1024  gives  an  estimated  efficiency  of  120  nJ/Unit- 
Transform  and  worst  case  throughput  of  2  ps.  These  results  demonstrate  that  current 
space-based  FFT  processors  can  be  replaced  with  a  design  that  improves  performance 
and  efficiency  by  two  orders  of  magnitude. 


AN  IMPROVED  ASYNCHRONOUS  IMPLEMENTATION 
OF  A  FAST  FOURIER  TRANSFORM  ARCHITECTURE 
FOR  SPACE  APPLICATIONS 


1.  Introduction 


1.1  Introduction 

The  goal  of  this  research  is  to  investigate,  design  and  implement  an  asynchronous 
very  large  scale  integrated  (VLSI)  circuit  targeted  for  space  applications  that  calculates  a 
Fast  Fourier  Transform  (FFT)  using  the  Suter  and  Stevens  architecture  [1].  The  research 
presented  in  this  thesis  is  the  continuation  of  work  performed  in  a  previous  thesis  [2].  An 
improved  design  of  a  16-point  FFT  (FFT- 16),  from  initial  concepts  to  the  simulation  test 
results  is  presented.  A  subset  of  the  design  (an  FFT-4)  was  fabricated  using  the  Hewlett- 
Packard  0.5  |J,m  commercial  process  for  future  analysis. 


1.2  Problem  Statement 

There  is  an  identified  need  for  a  fast,  low  power  FFT  processor  for  space 
applications.  Theoretically,  the  implementation  of  the  Suter  and  Stevens  architecture  in 
an  asynchronous  fashion  should  result  in  an  extremely  fast,  low  power  FFT  processor. 
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Such  a  design  can  also  be  used  in  the  space  environment  by  replacing  the  standard  VLSI 
cells  with  radiation  tolerant  cells. 

The  Suter  and  Stevens  architecture  inherently  lends  itself  to  an  asynchronous 
implementation.  However,  current  VLSI  design  tools  are  not  capable  of  asynchronous 
circuit  synthesis.  Consequently,  a  large  portion  of  this  research  covers  the  asynchronous 
design  methodology.  Asynchronous  design  implies  that  the  global  clocking  strategy  used 
in  synchronous  design  is  removed  and  replaced  with  a  self-timing  scheme,  which  lowers 
the  energy  requirement  of  the  circuit  [3].  The  FASST  (Fully  Asynchronous  Suter 
Stevens  Transform)  acronym  used  throughout  this  thesis  refers  to  the  asynchronous 
implementation  of  the  Suter  and  Stevens  architecture. 

The  space  application  requirement  is  met  by  using  a  VLSI  design  library, 
developed  jointly  by  the  Air  Force  Research  Laboratory  (AFRL)  and  Mission  Research 
Corporation  (MRC)  [4].  This  unique  library  enables  a  circuit  design  for  space  application 
to  be  fabricated  in  a  commercial  foundry.  The  tradeoff  for  using  this  library  is  that  both 
the  overall  die  area  and  energy  requirement  is  increased  for  the  design. 


1.3  Methodology 

The  first  step  in  a  VLSI  design  is  to  define  the  top-level  function  of  the  circuit. 
Then,  initial  design  constraints  are  ehosen,  including  the  CMOS  technology  and  the  data 
word  format.  Goals  are  established  for  area,  power  and  performance.  The  architecture  of 
the  design  is  then  selected.  The  next  step  is  to  break  down  the  architecture  into 
manageable  blocks,  with  well-defined  interfaces  and  specifications.  Each  block  is 


1-2 


developed  from  the  behavioral  level  to  the  physical  layout  and  simulated  at  each  level  for 
verification.  When  all  blocks  are  complete,  they  are  tested  together  to  verify  top-level 
operation.  Circuit  extractions  from  the  physical  layout  are  used  to  test  the  circuit  in  a  high 
fidelity  simulation  in  order  to  evaluate  the  function,  performance  and  efficiency  of  the 
design.  The  design  and  verification  process  is  repeated  until  the  established  design  goals 
are  met.  Once  the  design  is  fabricated,  it  can  be  tested  and  compared  to  the  simulation 
results  [5]. 

1.4  Overview 

This  thesis  is  organized  into  six  chapters.  The  chapters  are  organized  in  a  logical 
manner  beginning  with  the  problem  statement  and  concluding  with  results.  Chapter  One 
introduces  the  problem,  the  design  methodology  used  and  includes  an  overview  of  the 
thesis. 

Chapter  Two  provides  an  overview  of  asynchronous  circuit  design  methodologies, 
FFT  theory  and  the  radiation  hardening  of  electronics.  The  chapter  concludes  with  a 
presentation  of  other  work  related  to  this  research  area.  The  chapter  highlights  the 
significance  of  the  problem  and  gives  the  motivation  for  this  thesis. 

Chapter  Three  presents  an  overview  of  the  FFT-16  design.  The  theory  presented 
in  Chapter  Two  is  applied  to  the  development  of  a  working  solution.  The  function  of  the 
top-level  design  and  the  major  components  are  discussed. 
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Chapter  Four  is  a  presentation  of  the  design  implementation  of  each  functional 
block.  Each  block  introduced  in  Chapter  Three  is  revisited  in  detail.  The  fined  design  of 
each  block,  as  well  as  other  possible  designs,  are  presented. 

Chapter  Five  is  a  presentation  and  analysis  of  the  simulation  results.  Results  are 
given  at  each  level  of  design  for  the  individual  components  as  well  as  the  results  of  the 
top-level  design. 

Chapter  Six  concludes  the  thesis  by  comparing  the  results  of  this  research  with  the 
research  efforts  presented  in  Chapter  Two.  Lessons  learned  during  the  design  process  are 
presented.  Recommendations  for  future  work  in  this  area  are  also  given. 
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2.  Literature  Review 


2.1  Introduction 

The  purpose  of  this  chapter  is  to  present  applicable  research  in  the  subject  areas  of 
asynchronous  design,  FFT  theory,  and  radiation  hardened  electronics.  A  literature  search 
failed  to  identify  a  single  design  or  any  research,  which  combines  these  areas  other  than  a 
previous  student’s  thesis  on  the  same  topic  [2][6].  There  are  a  few  examples  of  designs, 
which  overlap  two  of  these  areas  [7].  The  chapter  concludes  with  a  presentation  of  other 
FFT  processors  that  are  comparable  to  this  effort.  A  table  is  presented  in  Section  2.5 
sununarizing  the  performance  of  these  designs. 


2.2  Asynchronous  Design 

Asynchronous  circuit  design  is  not  a  new  concept  in  electronic  circuit  design.  In 
fact,  asynchronous  circuits  have  been  used  since  the  1950’ s  but  have  not  been  widely 
adopted  by  modem  industry  [8].  Asynchronous  circuits  have  the  potential  to  out-perform 
synchronous  circuits.  However,  the  tools  and  support  community  are  still  not  as  mature 
as  those  used  in  the  mainstream  development  of  synchronous  circuits.  The  following 
sections  contrast  the  synchronous  and  asynchronous  design  methodologies  [3]  [9]. 
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2.2.1  Asynchronous  Versus  Synchronous  Design 

Synchronous  design  implies  that  a  global  clock  is  used  to  synchronize  the 
exchange  of  data  between  components  in  a  design.  Lx)gic  blocks  are  typically  surrounded 
by  latches,  which  save  the  state  of  the  block  during  each  clock  cycle.  The  clock  rate  is 
defined  by  the  critical  path  through  the  system  [5]. 

Asynchronous  design  removes  the  global  clock  and  replaces  it  with  a  self-timed 
protocol.  Interconnected  blocks  conununicate  and  exchange  data  with  a  sequence  of 
handshakes  [3]. 

2.2.2  Synchronous  Design  Flow 

Traditional  synchronous  VLSI  circuits  are  designed  using  modern  synthesis  tools. 
A  hardware  description  language  (HDL)  is  used  to  describe  the  behavior  of  the  circuit. 
The  Very  High  Speed  Integrated  Circuit  Hardware  Description  Language  (VHDL)  is  the 
DoD  standard  and  consequently  was  used  in  this  effort  [10]. 

Typically,  a  high  level  behavioral  description  of  the  circuit  under  development  is 
written  in  VHDL.  The  behavioral  VHDL  is  then  translated  into  structural  VHDL  with  a 
tool  such  as  the  Synopsys  Design  Analyzer  [11].  This  essentially  translates  the  design 
from  a  logical  algorithmic  behavior  to  a  realizable  gate  level  structural  design.  Once  the 
structural  design  is  verified,  a  layout  netlist,  which  describes  the  component  connections, 
is  generated  from  a  tool  such  as  the  Synopsys  Graphical  Environment  [12].  The  circuit 
netlist  is  then  used  with  a  standard  cell  library  to  produce  a  final  layout  with  a  place  and 
route  tool  such  as  the  Lager  Octtools  [13].  Figure  2-1  summarizes  the  typical 
synchronous  design  flow. 
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Figure  2-1.  Synchronous  Design  Flow 


2.2.3  Asynchronous  Design  Flow 

While  an  established  design  flow  exists  for  synchronous  circuits,  the  same  is  not 
true  for  asynchronous  design.  No  suite  of  design  tools  exists  that  allows  an  easy  flow 
from  behavioral  VHDL  to  an  automated  layout.  Similar  to  synchronous  design,  the 
asynchronous  design  process  begins  with  a  behavioral  VHDL  description.  However,  the 
automated  process  ends  here  and  is  replaced  with  a  combination  of  manual  design  and 
partial  automatic  synthesis  to  arrive  at  a  stractural  VHDL  description.  The  design  flow 
continues  as  normal  onee  the  structural  VHDL  is  validated.  The  methods  used  in  the 
design  phase  between  behavioral  and  structural  VHDL  are  deseribed  in  this  section. 

The  components  designed  in  this  effort  fell  into  several  design  eategories.  The 
fundamental  mode  bounded  delay  methodology  is  used  for  blocks  with  relatively  fixed 
completion  times.  The  delay  insensitive  design  methodology  applies  to  functional  blocks 
with  widely  varying  completion  times.  Burst  mode  design  methodology  applies  to 
components  that  serve  as  controllers  or  asynchronous  finite  state  machines  (AFSMs). 
Finally,  the  speed  independent  model  specifies  the  handshaking  protocols  between  major 
functional  blocks.  These  issues  are  presented  in  the  following  sections  [3]. 

2.2.3. 1  Fundamental  Mode  Bounded  Delay  Methodology 

The  fundamental  mode  bounded  delay  methodology  was  used  for  functional 
blocks  that  had  little  variation  in  completion  time  [14].  This  methodology  assumes  that 
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the  delay  time  through  a  functional  block  is  known  and  constant.  Worst-case  delays  are 
used  similar  to  a  clocked  circuit.  The  best  example  of  this  type  of  functional  block  is  a 
latch.  The  easiest  way  to  determine  when  data  is  fully  latched  is  through  a  delay  element. 
The  delay  element  should  have  a  slightly  greater  delay  than  the  completion  time  of  the 
logic  block  [3]. 

Difficulty  arises  in  synthesizing  this  structure  since  timing  information  cannot  be 
synthesized  from  behavioral  VHDL.  The  delay  element  must  be  described  at  the 
structural  level.  The  length  of  the  delay  element  is  determined  through  a  layout-level 
simulation  of  the  logic  block.  The  results  of  the  simulation  are  then  back  annotated  into 
the  structural  VHDL  design.  Figure  2-2  shows  a  delay  element  used  to  model  the  latch 
completion  time.  An  acknowledge  (ACK)  signal  is  asserted  when  the  data  is  latched 
after  the  request  (REQ)  is  generated. 


Figure  2-2.  Fundamental  Mode  Bounded  Delay  Applied  to  a  Latch 

2.2. 3.2  Delay  Insensitive  Methodology 
A  delay  element  is  not  suitable  for  functional  blocks  with  widely  varying 
completion  times,  since  the  benefit  of  an  average  delay  throughput  is  not  realized. 
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Additional  logic  can  be  added  to  this  type  of  block  to  detect  when  it  exeeution  is 
complete  [3]. 

VLSI  synthesis  tools  do  not  have  the  capability  to  generate  the  completion 
detection  circuit  for  a  particular  functional  block.  For  example,  they  do  not  know  how  to 
synthesize  the  completion  detection  circuit  for  an  adder.  Figure  2-3  shows  a  typical  one- 
bit  adder  without  completion  detection. 

A  dual-rail  adder  scheme  similar  to  the  Manchester  adder  can  be  used  to 
implement  completion  detection,  as  shown  in  Figure  2-4  [15].  The  dual  rail  adder  works 
on  the  prineiple  that  each  stage  will  have  either  a  carry  out  (COUT)  or  no  carry  out 
(NOCOUT)  condition  based  on  the  inputs  to  the  stage.  Adding  0  and  0  will  never  result 
in  a  carry  out,  even  if  there  is  a  carry  in.  Likewise,  adding  1  and  1  will  always  result  in  a 
carry  out,  even  if  there  is  a  carry  in  of  0.  Therefore,  the  earry  condition  in  these  cases  can 
be  determined  by  the  data  to  be  summed  alone  and  gives  an  early  completion  detection. 
Adding  a  0  and  1  or  1  and  0  may  or  may  not  have  a  carry  out  depending  on  the  earry  in 
condition.  In  this  case,  the  stage  must  wait  for  either  a  earry  in  (CIN)  or  no  carry  in 
(NOCESf)  value.  The  end  result  is  the  completion  detection  circuit  simply  becomes  the 
NOR  of  the  COUT  and  NOCOUT  values.  Whenever  one  of  these  conditions  exist,  it 
indicates  that  all  input  values  necessary  for  evaluating  the  sum  are  present  and  DONE  is 
asserted. 
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Figure  2-3.  One-Bit  Adder  without  Completion  Detection 


Figure  2-4.  One-bit  Adder  with  Completion  Detection 


2.2. 3.3  Burst  Mode  Methodology 

The  burst  mode  design  methodology  is  used  to  design  asynchronous  controllers  or 
finite  state  machines.  Synchronous  finite  state  machines  are  easily  synthesized  by  using 
latches,  flip-flops  and  clock  circuitry.  Asynchronous  controllers  or  AFSMs  must  be 
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synthesized  using  a  specialized  design  tool.  Unfortunately,  no  commercial  tools  exist 
with  this  capability  [3]. 

AFSMs  can  be  designed  by  hand,  but  is  a  very  tedious  and  error-prone  process. 
Tools  have  been  developed  by  universities  and  corporate  laboratories  that  automatically 
synthesize  AFSMs.  The  Most  Excellent  Asynchronous  Tool  (MEAT)  is  an  early  example 
of  such  a  tool  [16].  A  more  recently  developed  tool  called  3D  took  the  basic  principles  of 
MEAT  and  further  refined  them.  The  3D  tool  was  used  in  the  design  phase  of  this  effort 
because  it  is  kept  up  to  date  and  maintained  by  UC  San  Diego  [17]. 

A  state  table  of  entry  and  exit  conditions  for  the  state  machine  is  provided  to  3D 
by  the  user.  An  example  state  table  is  shown  in  Table  2-1  for  a  Johnson  counter 
(00^01^1 1^10).  3D  converts  the  state  table  to  positive  logic  equations.  These 
equations  are  then  manually  converted  into  behavioral  VHDL.  The  Synopsys  Design 
Analyzer  (with  structuring  and  Boolean  optimization  disabled)  is  used  to  convert  the 
positive  logic  behavioral  VHDL  into  negative  logic  structural  VHDL.  After  the 
structural  VHDL  is  generated,  reset  circuitry  and  corrections  for  fanout  are  added 
manually  to  the  controller  circuit.  The  final  two-bit  Johnson  counter  circuit  is  shown  in 
Figure  2-5,  which  includes  the  reset  circuit.  Once  the  structural  VHDL  is  complete,  it  is 
tested  using  a  VHDL  simulator. 

Depending  on  the  complexity  of  the  AFSM,  3D  may  not  be  able  to  synthesize  the 
controller.  The  controller  must  then  be  broken  down  using  Shannon  decomposition  and 
resynthesized.  Once  an  AFSM  has  been  synthesized  and  validated  in  VHDL,  it  can  be 
used  in  a  physical  layout. 
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Table  2-1.  3D  State  Table  of  a  Two-Bit  Johnson  Counter 


Present  State 

Next  State 

Entry  Conditions 

Exit  Conditions 

0 

1 

COUNT+ 

1 

2 

COUNT- 

BITO+ 

2 

3 

COUNT+ 

3 

4 

COUNT- 

Bm+ 

4 

5 

COUNT+ 

5 

6 

COUNT- 

BITO- 

6 

7 

COUNT+ 

7 

0 

COUNT- 

BITl- 

Figure  2-5.  Gate  Level  Schematic  of  a  Synthesized  Two-Bit  Johnson  Counter 


The  two-bit  Johnson  counter  example  is  used  to  illustrate  how  asynchronous 
synthesis  tools  work,  but  it  highlights  how  automated  AFSM  synthesis  does  not  always 
produce  the  optimal  solution  [9].  A  better  implementation  of  the  two-bit  Johnson  counter 
is  accomplished  by  using  two  D-registers,  as  shown  in  Figure  2-6.  This  type  of  counter  is 
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used  throughout  the  design.  The  Johnson  counter  was  selected  due  to  the  fact  that  it 
changes  only  one  bit  each  clock  cycle,  thus  avoiding  possible  data  hazards. 


Figure  2-6.  Improved  Two-Bit  Johnson  Counter 


2. 2. 3. 4  Speed  Independent  Methodology 

Functional  blocks  in  an  asynchronous  design  must  have  a  standard  handshaking 
protocol  in  order  to  be  compatible  with  other  blocks.  A  generic  functional  block  in  an 
asynchronous  design  is  shown  in  Figure  2-7.  The  input  and  output  signal  names  shown 
here  are  used  throughout  this  research  effort.  The  REQIN  signal  represents  the  external 
request  to  the  block  to  input  new  data.  The  ACKIN  signal  is  asserted  when  the  new  input 
data  is  fully  latched  or  accepted.  The  REQOUT  signal  represents  the  request  of  the 
functional  block  to  send  processed  data  out.  The  ACKOUT  signal  is  the  external 
acknowledgement  from  the  next  block  that  the  processed  data  was  latched  or  accepted. 
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REQIN 

REQOUT 

FUNCTIONAL 

- - - - - ► 

^ _ 

BLOCK 

ACKIN 

^  ACKOUT 

Figure  2-7.  Asynchronous  Functional  Block 


The  speed  independent  methodology  describes  two  standards  for  handshaking 
between  connecting  blocks.  It  does  not  assume  any  pre-defined  delays  but  relies  on  a  set 
of  handshaking  signals  between  the  blocks.  The  two-phase  model  is  illustrated  in  Figure 
2-8.  It  is  a  scheme  that  senses  signal  transitions  to  complete  the  handshake  cycle.  The 
first  exchange  is  signaled  by  a  low  to  high  transition  on  REQ  (1).  ACK  (2)  responds  by 
acknowledging  the  request.  The  second  cycle  uses  the  complementary  set  of  transitions 
to  complete  the  cycle  [3]. 


REQ 


ACK 


1  1 


Cycle  1  *•  Cycle  2 


Figure  2-8.  Two-phase  Model 


The  four-phase  model  is  illustrated  in  Figure  2-9.  It  has  a  four-cycle  handshake 
for  each  data  exchange.  Although  the  four-phase  model  appears  to  be  more  difficult  to 
implement,  its  detection  circuit  is  actually  smaller  than  the  two-phase  model  [3].  The 
four-phase  model  is  the  primary  interface  standard  used  throughout  this  design. 
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2.2. 3.5  Asynchronous  Design  Flow  Summary 
Figure  2-10  illustrates  the  asynchronous  design  flow  used  in  this  research. 
Comparing  Figure  2-10  with  Figure  2-1  highlights  the  additional  manual  intervention 
necessary  to  arrive  at  a  complete  asynchronous  design  compared  to  a  synchronous 
design. 


Figure  2-10.  Asynchronous  Design  Flow 


Research  using  the  asynchronous  design  methodology  presented  additional 
challenges  over  those  present  in  synchronous  design.  Prototype  design  tools  and  manual 
circuit  design  involving  an  iterative  trial  and  error  process  were  used  to  bridge  the  gaps 
between  the  asynchronous  tools  and  the  VLSI  design  tools. 
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2.3  FFT  and  FASST  Theory 


Computing  an  FFT  is  a  very  efficient  way  to  digitally  convert  a  time  domain 
signal  into  a  frequency  domain  signal.  A  short  explanation  of  the  FFT  and  FASST  are 
provided  in  this  section. 

2.3.1  The  Fourier  Transform 

The  Fourier  Transform  is  a  mathematical  operation  that  is  used  to  convert  data  in 
the  time  domain  to  the  frequency  domain.  The  basic  equation  is  shown  in  Equation  2-1 
where  x(t)  is  the  time  domain  signal,  X(f)  is  the  transformed  frequency  domain 
component,  and  the  exponential  represents  the  Fourier  series  components  [18]. 

X{f)=]x{t)e-^^’^dt  (2-1) 

*"■00 

To  solve  the  Fourier  Transform  using  a  computer,  the  input  data  and  calculations 
must  be  broken  into  finite  segments  of  a  given  size  and  processed  using  a  slightly 
different  formula.  This  modified  formula  is  called  the  Discrete  Fourier  Transform  (DFT) 
and  is  introduced  in  the  next  section. 

2.3.2  The  Discrete  Fourier  Transform 

Once  can  apply  numerical  integration  to  Equation  2-1  to  create  a  problem  that  is 
more  amenable  to  implementation  on  a  computer  as  a  trapezoidal  formula  since 
x{ti )  =  )  as  shown  in  Equation  2-2. 

^ (/J  =  E ,k  =  Q,...,N-\  (2-2) 

i=0 
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Typically,  one  will  analyze  a  problem  where  there  are  N  data  points  of  the 
function,  which  are  assumed  to  be  represented  by  no  more  than  N  different  sinusoids. 
Equation  2-2  reveals  a  complexity  of  N  where  each  operation  requires  a  multiplication. 
Thus  we  can  approximate  the  continuous  Fourier  transform  using  a  discrete 

representation  of  the  transform  by  letting  t,.  =  ndt  and  =  mSf  where  —  =  N6t .  With 
these  substitutions,  the  discrete  Fourier  transform  can  be  represented  as  in  Equation  2-3. 

.2nnm 

X{m)  =  '^x{n)e~'~^  (2-3) 

n=0 

N  represents  the  number  of  samples  of  the  finite  sequence  (which  is  commonly 
referred  to  as  the  point  size  of  a  DFT),  x(n)  represents  the  time  domain  values  of  the 
sequence,  X(m)  represents  the  frequency  domain  components  of  the  Fourier  Transform  of 
x(n).  There  are  N  complex  multiply  operations  required  to  solve  the  equation,  which 
translates  to  4A^real  multiply  operations  on  a  computer  (this  makes  a  total  of  N  x4N  or 
4N^  operations  which  is  expressed  as  0(N^)).  By  taking  advantage  of  complex  conjugate 
symmetry,  the  DFT  can  be  solved  in  less  than  0(N^}  operations.  This  concept  is  the 
premise  of  the  Fast  Fourier  Transform  (FFT)  [18]. 

2.3.3  The  Fast  Fourier  Transform 

The  FFT  was  developed  in  the  1960’s  when  signal  processing  was  becoming  an 
interesting  research  tool.  Limited  computational  resources  sometimes  prohibited  using 
the  DFT  to  evaluate  large  point  sizes.  An  algorithm,  which  simplified  the  computation  of 
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the  DFT,  was  developed  and  was  later  called  the  Fast  Fourier  Transform  [19].  This 
transform  is  briefly  described  in  this  section. 

The  first  step  in  simplifying  the  DFT  is  to  use  the  substitution  =  e  ^  ^  .  The 
DFT  can  be  expressed  as  shown  in  Equation  2-4. 

Z(m)  =  X4n)W;'"  (2-4) 

n=0 

By  taking  advantage  of  the  complex  conjugate  symmetry  of  Wj^,  the  number  of 
overall  computations  is  reduced  from  0(N^)  down  to  0(Nlog2N)  which  classifies  the 
algorithm  as  a  “fast”  Fourier  Transform  or  FFT.  This  is  an  extremely  advantageous 
property  as  such  a  reduction  before  implementing  an  algorithm  in  software  or  hardware 
will  realize  a  substantial  performance  increase.  The  comparison  of  the  number  of 
multiplication  operations  required  by  the  FFT  and  DFT  is  clearly  shown  in  Figure  2-11 
[18]. 


DFT  vs.  FFT  Compuatational  Demand 


Point  Size 


Figure  2-11.  FFT  vs.  DFT  Computational  Comparison 
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2.3.4  The  Suter  and  Stevens  Fast  Fourier  Transform  Architecture 

The  Suter  and  Stevens  FFT  architecture  is  pipelined,  extremely  local  and 
eliminates  the  need  for  shared  memory.  It  utilizes  a  small  number  of  logic  blocks  that  are 
replicated  throughout  the  architecture  and  operate  in  parallel  [20]. 

A  simple  derivation  is  given  here  of  the  architecture.  Referring  to  Equation  2-4, 
the  substitution  N  =  N1N2  is  made.  Using  the  division  theorem  for  integers,  m  =  m2Ni  + 
m;  and  n  =  nyAa  +  n2  where  mu  nj  =  0,l,...,Ni-l  and  m2,  n2=0,l,...,N2-l.  The  polyphase 
components  [21]  of xfnj  are  defined  as  xfn)  =  x(M„  +  k),  k  =  0,...,M-1.  The  FFT  can  be 
broken  into  interdependent  equivalent  classes  of  calculations,  or  similar  types  of 
calculations,  by  letting  =  X(m2Ni  +  mi)  andx„2(ni)  =  x(niN2  +  n2).  This 

polyphase  notation  is  used  to  represent  Equation  2-3  as  Equation  2-5 

("^2 )  =  2  S  ^  (2-5) 

n2=0/ij  =0 

which  is  equivalent  to  Equation  2-6. 

Nj-lNi-l  .2;rm2MniA^2  ,2m2A)in2  .2miniA'2 

(2-6) 

^2=0  «1=0 


Equation  2-7  is  the  result  of  factoring  and  simplifying  the  first  exponential  term  to 


unity. 


Wj-l| 

^..("^2)=X 

«2=0| 


-}  ~ 


Ki  =0 


-J 


.2^2  M2 


N, 


(2-7) 


Using  the  Wn  substitution.  Equation  2-8  mathematically  describes  the  FASST 
architecture. 
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Ni-l 

^m,K)=2 

/l2=0 


rit  =0 


(2-8) 


Finally,  Equation  2-8  is  expressed  in  hardware  by  Figure  2-12. 


Figure  2-12.  FASST  Generic  Point  Size  Block  Diagram  [20] 


Figure  2-12  implies  that  there  are  several  basic  components  of  the  architecture. 
The  ilV2  blocks  are  decimators,  which  break  down  the  N  input  values  into  Nj  concurrent 
streams  at  a  frequency  of  1/  iVy.  The  Ni  and  N2  FFT  blocks  are  the  foundational  FFT 
components.  The  ®  components  are  the  complex  multipliers.  The  Ni  constants  store  the 
appropriate  constant  values  depending  on  the  point  size.  The  large  boxes  are  routing 
areas  for  post-multiplied  values  before  they  are  processed  by  the  N2  FFT.  The  final 
component  is  an  expander,  which  is  shown  as  tiVy  in  the  figure.  The  expanders  compose 
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the  output  values  of  the  N2  FFT  into  the  correct  output  order  and  reduce  the  frequency  by 
a  factor  of  A^2.  An  overview  of  these  components  is  provided  in  Chapter  Three  and  a 
detailed  presentation  of  how  they  were  implemented  is  presented  in  Chapter  Four. 


2.4  Radiation  Hardening  of  Electronics 

The  term  radiation  hardening  originated  from  a  class  of  military  electronics  that 
had  to  operate  through  and  survive  the  most  severe  radiation  environments.  Although 
many  applications  require  protection  from  radiation,  they  do  not  require  the  highest  level 
of  protection  attainable.  Cost  factors  dictate  that  application  specific  electronics  only  be 
radiation  tolerant.  A  variety  of  methods  can  be  used  to  protect  a  circuit  from  the  effects 
of  radiation  and  several  of  them  are  described  in  this  section  [22]. 

The  need  has  been  demonstrated  for  a  high  performance  FFT  processor  designed 
for  the  space  environment  and  has  driven  the  requirement  to  make  the  product  of  this 
research  radiation  tolerant.  The  space  environment  introduces  many  hazards  not  present 
on  earth.  Additional  design  measures  must  be  taken  to  prevent  these  hazards  from 
impacting  the  operation  of  this  design. 

2.4. 1  The  Need  for  Radiation  Hardening 

The  United  States  Air  Force  has  an  interest  in  circuits  that  are  able  to  perform  in  a 
radiation  environment.  Circuits  used  in  space  must  have  some  degree  of  radiation 
tolerance.  This  effort  explores  two  categories  of  radiation  exposure;  long  term  total 
ionizing  dose  and  short  lived  single  event  effects  [22]. 
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2.4. 1. 1  Total  Ionizing  Dose 

Complementary  metal  oxide  semiconductor  (CMOS)  circuits  account  for  a 
significant  majority  of  the  world’s  electronic  circuits  [5].  They  degrade  in  a  radiation 
environment  due  to  the  total  accumulated  dose  of  radiation.  This  degradation  is  seen  as  a 
negative  shift  in  the  transistor  threshold  voltage  and  decrease  in  gain.  With  enough 
voltage  threshold  shift,  the  circuit  will  start  consuming  power  even  when  not  switching. 
The  decrease  in  gain  causes  the  transistors  to  become  harder  to  switch.  After  extended 
exposure  to  radiation,  the  circuit  will  cease  to  function  [23]. 

The  main  source  of  degradation  comes  from  the  interaction  of  ionizing  radiation 
with  the  gate  and  field  oxides  (Si02)  in  the  device  structure.  The  gate  oxide  is  a  thin 
high-quality  oxide  used  to  insulate  the  gate  contact  from  the  transistor  channel.  The  field 
oxide  is  a  thick  low-quality  oxide  used  to  isolate  metal  traces  from  one  another  [22]. 

Ionizing  radiation  causes  the  formation  of  electron-hole  pairs  in  the  gate  oxide. 
Electrons  have  a  much  higher  mobility  than  holes  in  Si02  and  are  attracted  to  and  swept 
out  of  the  gate  in  a  nMOS  transistor.  The  holes  become  trapped  and  migrate  toward  the 
transistor  channel.  This  results  in  the  eventual  buildup  of  positive  charge  above  the 
transistor  channel  and  acts  like  the  charge  that  is  present  when  voltage  is  applied  at  the 
gate.  As  more  charge  is  trapped,  the  voltage  threshold  of  the  nMOS  transistor  becomes 
more  negative,  which  means  it  becomes  easier  to  turn  on.  With  enough  shift  in  threshold 
voltage,  the  transistor  will  be  turned  on  without  a  gate  voltage  applied.  Conversely,  a 
pMOS  transistor  becomes  more  difficult  to  turn  on.  Figure  2-13  shows  how  the  gate 
voltage  versus  drain  current  curve  changes  resulting  from  exposure  to  radiation  in  a 
nMOS  transistor  [23]. 
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GATE  VOLTAGE 

Figure  2-13.  I-V  nMOS  Curve  [23] 


The  field  oxide  also  traps  charge  due  to  ionizing  radiation.  The  trapped  positive 
charge  along  the  edges  of  the  nMOS  transistor  creates  a  leakage  channel.  Leakage  paths 
can  also  form  between  transistors  through  the  field  oxide.  This  constant  leakage 
contributes  to  increased  power  consumption  [22]. 

Figure  2-14  illustrates  how  a  circuit  exposed  to  a  radiation  environment  slowly 
increases  power  consumption  and  reduces  the  operating  frequency.  Eventually,  the 
circuit  will  cease  functioning  when  the  power  required  by  the  degraded  electronics 
exceeds  the  output  capability  of  the  power  supply.  Premature  failure  can  also  occur  when 
the  output  voltage  swing  of  the  transistors  becomes  insufficient  to  drive  successive  stages 
or  when  the  timing  is  degraded  to  the  point  where  the  circuit  does  not  operate  properly 
[23].  An  important  thing  to  note  about  an  asynchronous  space  design  is  that  it  will 
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automatically  adjust  its  operating  frequency  which  can  potentially  extend  the  life  of  the 
circuit. 
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Figure  2-14.  Total  Dose  Effects  [23] 


2.4.12  Single  Event  Ejfects 

When  a  high-energy  particle  passes  through  a  circuit  and  causes  a  disruption  in 
circuit  operation,  it  is  classified  as  a  single  event  effect  (SEE).  For  example,  a  proton  or 
ion  passing  through  a  latch  could  change  the  value  of  a  stored  bit.  This  event  is  called  a 
single  event  upset  (SEU)  [24]. 

Protons  and  high-energy  heavy  ions  typieally  cause  SEUs.  Space  vehieles 
passing  through  the  South  Atlantic  anomaly,  where  there  is  a  high  concentration  of 
protons  can  experience  SEU  activity  in  that  region.  These  particles  create  a  temporary 
presence  of  an  abundance  of  free  carriers  in  the  transistor  channel  region.  The  free 
carriers  in  effect  turn  the  channel  on  [24]. 
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If  a  channel  is  turned  on  in  a  combinational  logic  circuit,  the  effect  is  seen  as  a 
spike  in  the  output  data  and  usually  does  not  affect  system  operation.  However,  if  a 
channel  is  turned  on  that  is  part  of  a  memory  structure,  such  as  a  latch,  it  can  upset  the 
state  of  the  latch.  Upset  can  only  occur  if  enough  earners  are  present  in  the  transistor 
channel  to  turn  it  on  strongly  enough  to  change  the  state  of  the  latch.  SEU  can  be 
corrected  by  refreshing  memory  locations  on  a  periodic  basis  [24]. 

Another  effect  seen  in  CMOS  is  single  event  latchup  (SEL).  SEE  deseribes  the 
phenomenon  that  occurs  when  inactive  parasitic  transistor  regions  (pnpn  structure)  are 
turned  on  by  a  high-energy  partiele.  These  pnpn  regions  are  formed  in  CMOS  layouts 
due  to  the  elose  placement  of  nMOS  and  pMOS  transistors  and  have  the  eharacteristics  of 
a  silicon  controlled  rectifier  (SCR).  If  a  particle  with  enough  energy  passes  through  the 
controlling  pn  junction  of  the  SCR,  it  can  switch  the  SCR  on.  The  only  way  to  turn  the 
SCR  off  is  with  a  power  eycle  [24]. 

2.4.2  Methods  of  Radiation  Hardening 

Radiation  hardening  was  first  used  by  the  military  to  ensure  that  critical  systems 
would  be  operative  during  a  nuclear  war  [22].  However,  this  technology  has  become 
applicable  to  commercial  systems  as  more  satellites  are  being  placed  in  orbits  with 
elevated  radiation  levels.  There  are  many  ways  of  achieving  radiation  hardness.  Eaeh 
ean  be  used  individually  or  combined  to  achieve  the  highest  level  of  radiation  tolerance. 
Shielding,  fabiieation  process,  and  design  layout  techniques  are  typical  methods  used  to 
achieve  radiation  hardness  [22]. 
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2.4.2. 1  Radiation  Hardening  through  Shielding 

The  most  intuitive  way  to  protect  a  circuit  is  by  enclosing  it  in  a  metal  box  thick 
enough  to  shield  against  all  radiation.  This  is  impractical,  especially  if  it  is  to  be  used  on 
a  spacecraft  where  weight  is  a  concern.  Shielding  usually  involves  surrounding 
electronics  with  200-300  mils  of  aluminum,  which  is  usually  provided  by  the  satellite 
body.  This  shielding  will  block  out  low-energy  particles,  but  does  little  to  stop  high- 
energy  particles.  Additional  protective  measures  to  ensure  radiation  tolerance  are 
discussed  in  the  next  sections  [24]. 

2.4.2.2  Radiation  Hardening  through  Fabrication 

The  most  sophisticated  but  costly  way  to  harden  a  circuit  is  to  alter  the  fabrication 
process.  The  thickness  and  growth  method  of  the  gate  oxide  is  altered.  Thinner  gate 
oxides  are  more  resistant  to  total  ionizing  dose.  High  quality  oxides  also  increase  total 
ionizing  dose  resistance.  Both  of  these  methods  work  by  reducing  the  amount  of  charge 
trapped  in  the  oxide.  These  methods  drive  up  the  cost,  as  thin  oxides  require  precise 
controls  and  high  quality  oxides  require  more  time  to  grow  [22]. 

Another  fabrication  technique  used  to  increase  radiation  hardness  is  to  grow  the 
transistor  structures  on  a  high  quality  insulating  material.  This  method  reduces  the 
frequency  of  single  event  effects.  By  growing  devices  on  an  insulator,  the  parasitic 
transistor  region  is  eliminated,  thus  preventing  SEL.  SEU  is  also  reduced,  as  line  charge 
formation  from  an  ion  strike  is  minimized  [24]. 
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2.4.23  Radiation  Hardening  through  Layout 
Another  method  for  increasing  radiation  tolerance  of  a  VLSI  circuit  is  to  change 
the  design  rules  by  which  the  circuit  is  laid  out.  Usually  this  results  in  a  larger  area  due 
to  high  drive  strength  devices,  implying  a  loss  of  power  efficiency  [22]. 

Radiation  hardening  through  layout  is  the  only  viable  method  of  producing  a 
radiation  tolerant  design  for  a  thesis,  as  the  design  can  be  fabricated  on  an  inexpensive 
commercial  process  line.  The  only  additional  cost  comes  from  the  increased  area 
requirement.  The  gate  and  pad  library  designed  jointly  by  AFRL  and  MRC  used  in  this 
effort  achieves  maximum  radiation  tolerance  from  a  commercially  fabricated  circuit.  The 
library  is  designed  for  fabrication  using  the  HP  0.5  |xm  process  [4]. 
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Figure  2-15.  Radiation  Tolerant  Layout  of  an  Inverter  [4] 


Radiation  tolerance  to  total  ionizing  dose  and  single  event  effects  is  achieved 
through  layout.  A  radiation  tolerant  inverter  is  shown  in  Figure  2-15.  Total  ionizing 
dose  effects  are  minimized  by  the  use  of  annular  geometry  nMOS  transistors.  This 
geometry  minimizes  the  shift  in  V,  by  preventing  the  buildup  of  trapped  charge  near  the 


active  region.  The  transistors  are  surrounded  with  highly  doped  guard  rings,  which 
prevent  leakage  through  the  field  oxide  separating  the  transistors  and  reduce  SEL.  High 
drive  strength  transistors  reduce  SEU  and  lengthens  the  life  cycle  at  the  cost  of  some 
efficiency  [4].  A  benefit  of  using  the  HP  0.5  pm  process  is  that  it  demonstrates  a  higher 
tolerance  to  total  ionizing  dose  than  other  similar  processes.  Due  to  the  smaller  feature 
size,  its  SEU  tolerance  is  lower  than  larger  technologies  [25]. 


2.5  FFT  Comparison 

There  are  numerous  FFT  solutions  from  single  board  computers  to  application 
specific  integrated  circuits.  However,  there  is  no  single  chip  that  has  all  the  qualities 
discussed  in  this  chapter,  except  for  the  previous  thesis  effort.  This  section  presents  data 
found  on  FFT  processors  designed  for  space,  low  power,  speed  and  a  short  summary  of 
the  previous  thesis  effort. 

2.5.1  General  Purpose  FFT  Processors 

The  most  common  method  of  computing  an  FFT  in  space  is  through  a  general 
purpose  radiation  hardened  microprocessor.  One  such  processor  is  the  RAD  6000,  which 
is  the  IBM  R/S  6000  processor  fabricated  by  Lockheed  Martin’s  radiation  hardened 
fabrication  process  [26]. 

General-purpose  machines  are  a  poor  example  for  comparison  of  power  versus 
speed,  as  they  are  generally  not  optimized  for  FFT  calculations  and  use  a  significant 
amount  of  power.  For  example,  the  RAD  6000  in  a  low  power  configuration  consumes 
2.5  Watts  when  running  at  2.5  MHz,  which  translates  to  about  2  MIPS  [26]. 
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A  radiation  hardened  processor  optimized  for  Digital  Signal  Processing  (DSP)  is  a 
better  comparison.  Texas  Instruments  has  fabricated  a  radiation-hardened  version  of  their 
C40  DSP  chip.  Specifications  on  the  commercial  version  are  published.  The  assumption 
should  be  made  that  the  space  bound  version  has  a  lower  performance  than  the 
commercial  version  [27]. 

2. 5. 2  Low-Power  FFT  Processors 

An  outstanding  research  effort  in  low  power  FFT  processors  is  ongoing  at 
Stanford  University.  An  FFT- 1024  processor,  called  Spiffee,  was  designed  and 
fabricated  with  low  power  in  mind  [28].  It  has  a  high  efficiency  of  24.7  nJAJnit 
Transform.  The  design  is  based  on  a  clocked  radix-2  decimation  in  time  form  of  the  FFT 
with  a  core  butterfly  processor  and  uses  low  V,  transistors.  The  core  of  the  chip  operates 
at  a  lower  voltage  than  the  I/O  circuitry.  This  design  is  very  sensitive  to  radiation  and 
noise  and  is  not  suitable  for  space  application. 

2.5.3  High-Performance  Processors 

A  processor  designed  for  speed  alone,  called  COBRA,  was  investigated  to  get  an 
idea  what  the  high-end  computational  speed  is  for  computing  a  1024-point  FFT  [29]. 

The  COBRA  architecture  implements  a  clocked  radix-4  decimation  in  time  form  of  the 
FFT.  It  uses  multiple  butterfly  processors  connected  through  a  switch  matrix.  A  single 
COBRA  chip  can  only  execute  an  FFT-64.  Sixteen  chips  are  used  in  parallel  to  compute 
an  FFT-1024.  This  design  is  also  not  suitable  for  space,  due  to  the  custom  multiple  chip 
implementation  of  the  FFT-1024. 
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2.5.4  Previous  Thesis  Research 


The  previous  thesis  research  developed  the  first  implementation  of  the  FASST 
architecture  [2].  The  design  style  used  was  different  than  the  one  that  is  presented  in  this 
thesis.  An  effort  was  made  to  produce  a  small  implementation,  which  used  core  ALUs 
connected  to  memory  banks  and  coordinated  with  one-hot  style  controllers  [3].  A  6-bit 
FFT-4  test  chip  was  fabricated,  however  the  chip  unfortunately  had  design  errors  which 
prohibited  the  determination  of  any  performance  data.  Simulation  data  was  available 
detailing  the  performance  of  the  FFT-16  and  extrapolations  to  an  FFT-1024. 


2.5.5  Performance  Comparison 

Table  2-2  gives  a  comparison  of  the  FF  l  processors  discussed  in  the  previous 
sections.  This  data  was  obtained  from  an  FFT  comparison  Internet  site  maintained  by 
Stanford  University  [30].  Background  papers  were  checked  to  verify  the  published  data. 
The  data  necessary  to  verify  the  Spiffee  throughput  time  and  the  COBRA  efficiency  was 
not  available.  These  two  results  should  be  considered  unreliable  estimates  [31]. 

Table  2-2.  A  Comparison  of  FFT  Processors 


Processor 

Name 

Design  Feature 

Dataword 
Size  &  Type 

Supply 
Voltage  (V) 

FFT-1024 
throughput 
time  (ps) 

Efficiency 

(nJ/Unit 

Transform) 

C40 

Space 

32-Bit 

Floating 

Point 

5.0 

1298 

5704 

Spiffee 

Low  Power 

20-bit 
Fixed  Point 

3.3 

30 

24.7 

COBRA 

Speed 

23-bit 
Fixed  Point 

5.0 

9.5 

71.4 

FASST 

Asynchronous, 
space  &  low 
power 

16-bit 
Fixed  Point 

3.3 

10 

120 
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2.6  Conclusion 


The  goal  of  this  thesis  is  to  design  a  fast,  efficient  FFT  processor  suitable  for 
space.  This  chapter  covered  the  basics  of  asynchronous  design,  FFT  theory  and  radiation 
hardening  of  electronics  to  support  the  goal  of  the  thesis.  The  final  section  compared 
FFT  processors  that  are  the  best  in  their  category  with  the  results  from  a  previous  thesis. 
The  results  of  the  previous  thesis  demonstrated  that  the  efficiency  and  performance 
achieved  using  the  FASST  architecture  is  exactly  what  was  desired.  The  estimated  FFT- 
1024  performance  and  efficiency  shows  an  improvement  of  two  orders  of  magnitude. 
This  comparison  motivates  the  development  of  an  improved  functional  FFT  using  the 
FASST  architecture. 
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3.  Design  Overview 


The  goal  of  this  design  is  to  implement  a  functional  FASST  design  in  silicon.  A 
16-point  FFT  was  chosen  as  the  base-case  to  prove  that  the  FASST  architecture  works 
and  to  measure  the  performance  of  this  implementation.  It  also  utilizes  all  of  the 
necessary  components  which  are  used  in  arbitrarily  large  FFTs. 

This  chapter  discusses  what  the  design  constraints  are  and  how  they  impacted  the 
VLSI  design.  The  functionality  of  the  top-level  design  is  described,  as  well  as  the 
functionality  of  the  major  components.  Chapter  Four  revisits  these  components  in  greater 
detail  down  to  the  gate  level. 


3.1  Design  Constraints 

This  section  covers  design  constraints  imposed  on  the  selection  of  the  cell  library, 
data  t5^e  and  point  size.  The  thesis  sponsor  recommended  these  constraints. 

3.1.1  Cell  Library 

The  use  of  a  radiation  tolerant  VLSI  cell  library  was  the  method  chosen  to  meet 
the  space  application  requirement.  The  radiation  tolerant  library  cells  provided  by  AFRL 
were  designed  for  fabrication  on  the  commercial  HP  0.5  pm  foundry  process  line  [4]. 

The  size  of  the  radiation  tolerant  cells  is  a  notable  feature.  Consider  the  example 
of  the  difference  in  size  between  a  radiation  tolerant  minimum-sized  inverter  and  a 
standard  minimum-sized  inverter  in  the  Lager  IV  distribution  [13].  A  comparison  is 
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illustrated  in  Figure  3-1.  The  Lager  inverter  measures  68^xl6X,  while  the  MRC  cell 
measures  124X.x40X.  [4]. 


Figure  3-1.  Inverter  Size  Comparison  [13] [4] 

Several  design  features  were  engineered  into  the  radiation  tolerant  library  to 
enable  them  to  be  densely  packed.  The  cells  are  symmetrical  and  the  rings  around  the 
cells  are  allowed  to  overlap  the  rings  of  the  nearest  neighbors.  All  of  the  routing  within 
the  cells  is  at  the  lowest  level  metal,  which  allows  routing  of  multiple  traces  of  higher 
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level  metals  directly  over  the  cells.  These  design  features  allow  the  overall  design  size  to 
be  roughly  twice  as  large  as  a  minimum  sized  design,  provided  that  a  high  performance 
router  is  used  [4]. 

The  cells  are  not  optimally  power  efficient  because  they  are  not  minimum  sized. 
The  choice  of  using  the  asynchronous  design  approach  was  not  only  to  achieve  a  faster 
design,  but  also  to  compensate  for  the  additional  power  consumption  of  the  radiation 
tolerant  library.  The  HSPICE  [32]  plot  of  power  versus  time  in  Figure  3-2  indicates  that  a 
single  radiation  tolerant  inverter  (dashed  line)  uses  more  power  (area  under  the  curve) 
than  a  minimum  sized  inverter  (solid  line)  by  a  factor  of  two.  This  additional  power 
consumption  is  necessary  to  overcome  the  SEU  effects  in  memory  structures  discussed  in 
Chapter  Two. 


Figure  3-2.  Inverter  Power  Use  Comparison 


3.1.2  Data  Type  and  Size 

The  data  types  used  in  a  DSP  processor  are  integer,  fixed  point,  block-float  or 
floating-point.  The  size  and  type  of  the  data  greatly  affects  the  design  size  and 
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complexity.  Typically,  a  commercial  processor  will  use  a  floating-point  or  block-float 
format  to  enable  a  wide  product  application.  If  the  input  and  output  data  are  properly 
scaled,  floating  point  format  is  not  necessary  and  more  efficient  designs  can  be  realized. 
A  16-bit  word  size  for  the  real  and  imaginary  components  was  chosen.  This  data  format 
is  shown  in  Figure  3-3. 


15 

12  , 

1  11 

0 

*  Ordinate  *  ^  Mantissa 

Figure  3-3.  16-bit  Data  Word  Format 


This  data  format  chosen  allows  twelve  bits  of  resolution  in  the  mantissa.  This 
implies  that  the  data  word  can  represent  a  value  of  IOOO.OOOOOOOOOOOO2  (-8.O10)  to 
01 1 1. 1 1 1 1 1 1 1 1 1 1 1 12  (7.999755859375io). 

3. 1.3  Project  Point  Size 

A  16-point  FFT  (FFT-16)  is  the  minimum  size  used  to  demonstrate  the  FASST 
architecture.  The  basic  building  blocks  for  this  base-case  are  the  4-point  FFT  (FFT-4) 
and  the  complex  multiplier  [20].  The  FFT-4  proved  to  be  an  appropriate  building  block 
for  proving  the  concept  of  this  thesis  design  and  was  fabricated.  The  FFT-16  may  be 
fabricated  in  the  future  by  AFRL  depending  on  the  success  of  the  FFT-4  fabricated  chip. 


3.2  FFT-16  Design 

The  FASST  architecture  can  be  applied  to  any  point  size,  provided  that  N=NiN2. 
To  demonstrate  the  functionality  of  the  FASST  architecture,  an  FFT-16  (77=16) 


3-4 


implementation  with  Nj  =  N2  =  4  was  chosen.  This  implies  that  the  FFT-4  will  comprise 
the  basic  building  block  of  the  design. 

The  matrix  must  be  used  to  determine  the  constants  used  in  the  FFT-16.  As 
stated  previously,  for  an  FFT-16,  N=16  and  mi,n2  will  be  0,1,2,3.  The  matrix  is 


Using  the  A^=16  vector  constant  map  in  Figure  3-4,  can  be  represented  with 
complex  constants  as  shown  in  Equation  3-2. 


Figure  3-4.  FFT-16  Vector  Constant  Map 
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Now  that  all  the  components  are  derived  and  defined,  the  overall  generic  block 
diagram  in  Figure  2-12  applied  to  the  FFT-16  (with  Ni=N2  =  4)  results  in  a  data  flow 
block  diagram  (with  simplified  constants)  as  shown  in  Figure  3-5.  It  is  interesting  to  note 
that  other  texts  have  presented  this  data  flow  block  diagram  with  no  implication  made 
that  can  be  implemented  in  an  asynchronous  nature,  as  in  the  FASST  architecture  [33]. 
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The  data  flow  block  diagram  in  Figure  3-5  implies  that  an  FFT-16  will  require 
eight  FFT-4  components  (0  through  7)  and  only  three  complex  multipliers  because  the 
outputs  of  FFT-4  (0)  are  multiplied  by  one. 

Additional  components  are  needed  to  route  the  control  signals  and  store  the 
constant  values.  The  decimator  and  expander  are  responsible  for  routing  incoming  data 
and  ordering  outgoing  data,  respectively.  The  crossbar  is  responsible  for  handling  the 
interconnection  between  the  input  and  output  FFT-4  units. 

Sections  3.2.1  through  3.2.6  give  an  overview  of  the  composition  and  function  of 
each  major  component  of  the  FFT-16.  Section  3.2.7  combines  the  components  and 
describes  the  FFT-16  operation  at  the  top  level. 

3.2.1  FFT-4 

The  FFT-4  computes  the  four-point  FFT  of  the  input  sequence.  The  derivation  of 
this  base-case  FFT  is  presented  to  illustrate  how  the  math  correlates  to  a  physical  design. 

Using  Equation  3-3,  the  FFT-4  is  described  as  in  Equation  3-3. 

X{m)  =  W^x{n)  (3-3) 

Equation  3-1  can  be  expanded  into  matrix  format  to  show  the  values  of  the  W4 
coefficients.  This  expansion  is  shown  in  Equation  3-4. 


'X(0) 

<  Wf 

'x(oy 

X(l) 

wl  w/  w! 

x(l) 

Z(2) 

w°  Wf 

x(2) 

_X(3)_ 

_w;  w/  vf; 

_x(3)_ 

3-7 


By  using  the  vector  map  in  Figure  3-6,  the  constants  can  be  evaluated  (i.e. 

=  1,  =  -j,  etc.  where  j  =  .).  The  vector  map  is  periodic,  which  means,  for 

example,  vf  =  W^.  The  new  expression  is  shown  in  Equation  3-5  with  W  values  replaced 
with  constants. 


W3 


Figure  3-6.  FFT-4  Vector  Constant  Map 


'XiO) 

'1  1  1 

1  ■ 

'x{0) 

X(l) 

1  -j  -1 

j 

x(l) 

X(2) 

1  -1  1 

-1 

^(2) 

_X(3)_ 

.1  j  -1 

-j_ 

xi3) 

(3-5) 


Separating  the  matrix  into  individual  equations  yields  Equations  3-6  though  3-9. 


X  (0)  =  x(0)  -1-  x(l)  -1-  x(2)  +  x(3) 

(3-6) 

X(l)  =  x(0)->(l)-x(2)  +  >(3) 

(3-7) 

X  (2)  =  x{0)  -  x(l)  +  xi2)-  xi3) 

(3-8) 

Xi3)  =  xiO)  +  jxO-)-x(2)-jx{3) 

(3-9) 

Inspection  of  these  equations  reveals  that  there  are  twelve  complex  add  operations  and 
four  complex  multiplies.  The  complex  multiplies  are  eliminated  by  substitution  giving 
16  additions  or  subtractions.  By  letting  a  =  jc(0)  +  x(2),  b  =  xil)+  x(3),  c  =  ;c(0)  -  jc(2). 
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and  d-x{l)-  ^(3),  Equations  3-6  through  3-9  are  represented  by  Equations  3-10  through 
3-13. 


X{0)  =  a  +  b  (3.10) 

X{\)  =  c-jd  (3-11) 

X{2)  =  a-b  (3.12) 

X{3)  =  c+jd  (3.13) 


Expressing  the  complex  variables  as  real  and  imaginary  components  yields  Equations 
3-14  through  3-17. 

Re{Z(0)}-t-yIm{Z(0)}  =  Re{a}  +  yIm{a}-t-Re{&}-l-  jlm{b}  (3-14) 
Re{X(l)}-i-yim{Z(l)}  =  Re{c}+;Im{c}-;(Re{i/}  +  7lm{rf})  (3-15) 
Re{Z(2)}  +  7lm{Z(2)}  =  Re{a}-i-7lm{a}-(Re{fe}-t-  7lm{£?})  (3-16) 

Re{Z(3)}  +  yIm{Z(3)}  =  Re{c}+yIm{c}-(-y(Re{i/}-i-  ylmi^f})  (3-17) 

Finally,  expressing  Re{X}  and  Im{Z}  separately  with  j  factored  through,  the  final 
Equations  3-18  through  3-25  are  realized. 


Re{Z(0)}  =  Re{a)-i-Re{Z?}  (3-18) 

Im{Z(0)}  =  Im{a}  +  Im{Z>}  (3-19) 

Re{X(l)}  =  Re{c}  +  Im{rf}  (3-20) 

Im{Z(l)}  =  Im{c}-Re{rf}  (3-21) 

Re{Z(2)}  =  Re{a}-Re{Z?}  (3-22) 

Im{Z(2)}  =  Im{a}-Im{Z?}  (3-23) 

Re{X(3)}  =  Re{c}-Im{rf}  (3-24) 

Im{Z(3)}  =  Im{c}-i-Re{J}  (3-25) 


The  FFT-4  can  then  be  accomplished  with  only  16  individual  add  or  subtract 
operations  and  no  complex  multiplies.  No  complex  multiplies  makes  the  FFT-4  an  ideal 
base-case.  Figure  3-7  illustrates  the  final  layout  of  addition  and  subtraction  operations. 
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Re  {X(0)} 


3.2.2  Complex  Multiplier 

The  hardware  implementation  of  a  complex  multiply  must  model  Equation  3-26, 
which  is  the  mathematical  definition  of  the  multiplication  of  two  complex  values  X  and  Y. 
Equation  3-27  gives  the  equivalent  method  for  multiplication  in  hardware  that  has 
separate  real  and  imaginary  data  buses,  as  in  this  design. 

XY  =  (Re{Z}  +  ;Tm{X})(Re{y}  +  7Tm{y})  (3-26) 

XY  =  iRc{X}  +  jlm{X})(Rc{Y]  +  jlm{Y}) 

=  Re{X}Re{y}-Im{Z}Im{y}  +  ;(Re{X}Im{7}  +  Re{y}Im{Z})  ^ 

The  design  pursued  in  this  research  uses  asynchronous  building  blocks  to  produce 
a  fully  asynchronous  complex  multiplier.  The  complex  multiplier  block  diagram  is 
shown  in  Figure  3-8. 
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Figure  3-8.  Complex  Multiplier  Layout 


3.2.3  Decimator 

The  decimator  is  a  functional  block  that  takes  the  ordered  input  data  (x(0)  through 
jc(15))  and  routes  them  to  the  respective  FFT-4  blocks  shown  in  Figure  3-5.  The  values 
jc(0),  jc(4),  x(8)  and  x(l2)  are  sent  to  the  first  FFT-4.  The  values  jc(l),  x(5),  x(9)  and  ji:(13) 
are  sent  to  the  second  FFT-4.  The  next  two  groups  of  values  are  sent  to  the  third  and 
fourth  FFT-4  blocks,  respectively. 

This  block  is  implemented  with  a  two-bit  Johnson  counter  that  routes  the  FFT-16 
REQIN  and  ACKIN  signals  to  the  REQIN  and  ACKIN  lines  of  the  first  four  FFT-4s. 

The  connections  are  shown  in  Figure  3-9.  The  data  is  transferred  from  the  input  bus  to 
the  Fn'-4s  through  a  shared  input  bus  corrected  for  fanout.  This  design  would  require 
modification  if  it  were  used  in  a  high-speed  design. 
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REQIN^ 

DECIMATOR 

'ACKIN 

REQINO  ^ 

^ACKINO 
REQIN1  ^ 

^ACKINI 
REQIN2  ^ 

^  ACKIN2 
REQIN3  ^ 

^  ACKIN3 


Figure  3-9.  Decimator  Control  Signals 


3.2.4  Expander 

The  expander  performs  the  exact  opposite  operation  of  the  decimator  described  in 
the  previous  section.  It  takes  the  output  stream  from  the  final  set  of  FFT-4  blocks  and 
orders  the  values  so  they  appear  as  X(0)  through  Z(15)  on  the  output.  When  X(0)  becomes 
available  from  FFT-4  (4),  it  is  sent  out  first.  When  X(l)  becomes  available  from  FFT-4 
(5),  it  is  sent  out  next.  The  sequence  is  repeated  until  all  16  values  are  sent  out. 

Components  similar  to  the  ones  used  in  the  decimator  are  used  in  the  expander  to 
route  the  control  signals.  The  control  signals  are  shown  in  Figure  3-10.  The  data  is 
routed  through  a  mux  controlled  by  the  expander  hardware. 

REQOUT4^ 

^ACKOUT4 
REQOUT5 

ACK0UT5  1— 

REQOUT6  1“ 

.  r 

ACK0UT6 
REQOUT7^ 

^ACKOUT? 

Figure  3-10.  Expander  Control  Signals 


REQOUT. 

EXPANDER 

ACKOUT 
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3.2.5  "Crossbar” 


The  “crossbar”  is  responsible  for  handling  the  interconnections  between  in  the 
input  FFT-4s  and  the  output  FFT-4s,  as  shown  in  Figure  3-5.  Output  values  from  the 
input  FFT-4s  are  handled  using  the  same  method  as  the  expander.  In  effect,  the  four  pairs 
of  REQOUT/ACKOUT  signals  are  reduced  to  a  single  pair  of  REQOUT/ACKOUT 
signals.  Then,  according  to  Figure  3-5,  the  first  four  available  values  are  sent  to  FFT-4 
(4).  The  next  four  are  sent  to  FFT-4  (5),  and  so  on. 

The  crossbar  is  implemented  with  an  expander  coupled  to  a  divide  by  four 
element.  The  data  is  passed  on  a  shared  bus  with  tristate  buffers  that  the  crossbar 
controls.  The  signals  are  shown  in  Figure  3-11. 


REQOUTO^ 
VcKOUTO 
REQ0UT1  ^ 
^ACKOUTI  I 
REQOUT2 J 
^ACKOUT2 
REQ0UT3^ 
^ACKOUT3 


CROSSBAR 


REQIN4  ^ 
^  ACKIN4 

REQIN5  ^ 
\*  ACKIN5 

I  REQIN6  ^ 
^  ACKIN6 

REQIN7  ^ 
^  ACKIN7 


Figure  3-11.  Crossbar  Control  Signals 


3.2.6  Constant  Banks 

There  are  three  constant  banks,  which  store  the  values  of  the  complex  constants 
for  multiplication  with  the  first  stage  FFT-4  results,  as  shown  in  Figure  3-5.  These 
constant  banks  are  created  with  combinational  logic  controlled  by  an  AFSM. 
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3.2.7  Putting  the  FFT-16  Together 

Now  that  each  component  has  been  described,  the  top-level  picture  can  be 
presented.  The  components  interconnected  by  the  control  and  data  paths  are  shown  in 
Figure  3-12. 


Figure  3-12.  FFT-16  Components 


3.3  Design  Conclusion 

This  chapter  outlined  the  hardware  required  for  an  FFT-16  using  the  FASST 
architecture.  The  FASST  architecture  produces  a  layout  that  is  highly  localized  and 
reuses  major  components,  as  seen  in  Figure  3-12.  This  lays  the  groundwork  for  the 
detailed  discussion  of  each  component  in  Chapter  Four. 
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4.  Design  Implementation 


This  chapter  presents  the  design  details  of  each  component  discussed  in  Chapter 
Three.  The  chapter  begins  by  looking  at  the  FFT-16  top  level  design  and  then  discusses 
the  top-level  operation.  Other  designs  considered  for  each  component  are  also  presented. 


4.1  FFT-16 

The  FFT-16  executes  the  FFT  on  an  input  data  stream.  The  top-level  block 
diagram  is  shown  in  Figure  4-1.  The  complex  values  x(0)  through  x(15),  are  fed  in  order 
on  the  DATA_INR  (real  component)  and  DATA_INI  (complex  component)  input  buses. 
The  transformed  values  X(0)  through  X(15)  appear  on  the  output  buses  DATA_OUTR 
and  DATA_OUTI  in  order  as  evaluated  by  Equation  2-8  for  N=\6  and  Ni=N2=4.  Each 
data  input  and  output  cycle  takes  place  with  a  four-cycle  handshake. 


REQIN 

REQOUT 

w 

ACKIN 

FFT-16 

ACKOUT 

DATA_INR 

DATA.OUTR 

DATA_INI 

DATA_OUTI 

Figure  4-1.  FFT-16  Top  Level 
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The  FFT-16  is  composed  of  six  major  components,  as  shown  in  Figure  4-2.  The 
FFT-4,  complex  multiplier,  decimator,  expander,  crossbar  and  constant  banks  are 
discussed  in  detail  in  the  following  sections. 


Figure  4-2.  FFT - 1 6  Components 


4.2  FFT-4 

Previous  research  efforts  have  explored  several  architectures  of  the  FFT-4.  The 
most  recent  effort  used  16  registers  and  2  ALUs  (add/subtract  units)  [2].  The  author’s 
recommendations  for  continuation  suggested  that  there  might  be  more  simple  approach  to 
the  one  that  was  implemented.  The  design  suggested  was  to  use  16  dedicated  add  and 
subtract  units  to  accomplish  the  16  add  and  subtract  operations  for  the  FFT-4  instead  of 
two  ALUs. 
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The  idea  of  simplification  was  taken  a  step  further  in  this  research.  A  design  was 
pursued  that  used  only  8  latches  to  latch  the  inputs  and  8  ALUs.  The  reduction  from  16 
to  8  adders  was  possible  by  realizing  that  Equations  3-16  through  3-23  can  be  reduced  to 
four  equations  by  simply  switching  the  operator  in  the  matching  equations.  For  example, 
Equation  3-16  is  identical  to  3-20  with  the  exception  of  the  operator.  The  control  for  the 
design  proved  to  be  more  complex  than  the  16  element  option,  but  it  produced  a  design 
that  had  several  advantages.  The  area  was  reduced  by  over  50%  and  the  large  fan-outs 
present  were  also  reduced  by  50%.  The  energy  efficiency  does  not  change  much  as  the 
same  number  of  switching  operations  takes  place  although  a  small  gain  is  realized  due  to 
the  decreased  circuit  fan-out.  Figure  4-3  illustrates  the  interconnection  of  the  eight  ALUs 
used  to  accomplish  the  FFT-4  calculation. 


Re  {x(0)}  - 

—  Re  {X(0)} 

Re  {x(2)}  - 

—  Re  {X(2)} 

Re  {x(1)}- 

-^Re  {X(1)} 

Re  {x(3)}  - 

^  Re  {X(3)} 

Im  {x(1)}- 

->lm  {X{1)} 

Im  {x(3)}  - 

Im  {X(3)} 

Im  {x(0)}  - 

->  Im  {X(0)} 

Im  {x(2)}  - 

—  tm  {X{2)} 

Figure  4-3.  FFr-4  Block  Diagram  Using  8  ALUs 


This  design  is  much  smaller  in  area  than  the  one  previously  reported  [2].  Table 
4-1  highlights  the  difference  in  area  between  the  previous  design  and  the  one  developed 
in  this  research  effort.  The  area  of  the  2- ALU/1 6-Register  design  is  extrapolated  from 
the  original  size  of  6  bits  to  16  for  comparison. 
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Table  4-1.  FFT-4  Area  Comparison 


Design 

Dimensions 

Total  Size 

2- ALUs,  16-Registers 

9242^  X  S925X 

8- ALUs,  8-Latches 

6148Xx5942>. 

The  major  components  of  the  FFT-4  are  the  input  latches,  ALUs,  output 
multiplexor  (mux),  and  the  control  units  as  shown  in  Figure  4-4.  A  test  mux  is  included 
on  the  fabricated  FFT-4  to  incorporate  design  for  testability.  The  test  mux  is  not  included 
in  the  FFT-4  design  used  in  the  FFT-16. 


Re  {x(0)) 


Re  {x(2)) 


Re  {x(1)} 


Re  {x(3)} 


Im  {x(l)} 


Im  {x(3)} 


Im  {x(0)} 


Im  {x(2)} 


LATCH 

16 

LATCH 

16 


LATCH 

16 

LATCH 

16 


LATCH 

16 

LATCH 

16 


LATCH 

16 

LATCH 

16 


Re  {X(0)} 
Re  {X{2)} 
Re  {X(1)} 
Re  {X(3)} 
Im  {X(1)} 
Im  {X(3)} 
Im  {X{0)} 
Im  {X(2)} 


Figure  4-4.  FFT-4  Initial  Design 


4-4 


The  input  latches  are  the  only  memory  structures  in  the  FFT-4.  They  latch  the 
input  data  as  it  is  appears  on  the  input  bus.  The  ALUs  are  16-bit  asynchronous 
add/subtract  units.  Instead  of  a  traditional  ALU  which  is  clocked  at  the  worst-case  rate,  it 
has  completion  detection  circuitry  which  enables  a  better  than  worst  case  flow  of  data 
through  each  ALU.  The  output  mux  is  an  array  of  simple  2x1  mux  cells  that  route  the 
results  of  the  four  output  ALUs  to  the  output  bus.  The  test  mux  is  also  an  array  of  muxes 
that  route  internal  control  and  data  signals  to  test  ports. 

4.2. 1  Input  Latches 

The  input  latches  shown  in  Figure  4-4  are  composed  of  16  one-bit  latches  as 
shown  in  Figure  4-5. 


Figure  4-5.  One-bit  Latch  Cell 


The  delay  through  the  latches  is  nearly  constant.  The  ideal  asynchronous  design 
methodology  is  the  fundamental  mode  bounded  delay  model.  A  simple  delay  element  is 
used  to  signal  LACK  at  a  safe  time  after  the  LREQ  signal  is  asserted.  HSPICE  was  used 
to  determine  the  correct  number  of  inverters  needed  to  represent  the  delay  necessary  for 
each  latch.  The  HSPICE  simulation  is  shown  in  Figure  4-6.  For  this  element,  the 
simulation  showed  that  it  took  a  worst-case  delay  of  1.13  ns  to  latch  the  input  data.  To 
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Voltages  (lin) 


model  this  delay,  14  inverters  are  needed,  as  each  inverter  has  a  delay  of  0.08  ns.  Two 
additional  inverters  are  used  to  provide  a  margin  of  safety. 


4n  6n  6n  lOn  12n  14n  16n  18n  ZOn  22n  24n  26n  28n  30n  32n 


Time  (I In)  (TIME) 

Figure  4-6.  HSPICE  Simulation  of  Register  Latch  Time 


Figure  4-7.  LATCH  16  Schematic 
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Only  one  delay  element  is  necessary  and  is  included  in  the  top  level  LATCH16 
component.  Inverters  with  a  fanout  of  eight  were  used  to  correct  for  fanout  of  the  LREQ 
signal.  The  LATCH16  schematic  is  shown  in  Figure  4-7. 

4.2.2  Asynchronous  ALU 

The  asynchronous  ripple-carry  ALU  has  an  execution  time  that  can  vary  widely 
from  a  minimum  to  a  maximum  case.  The  ripple-carry  design  was  chosen  due  to  its 
small  size  and  simplicity  when  compared  to  other  adder  schemes.  If  one  assumes  that  a 
random  distribution  of  data  is  processed  by  the  ALU,  the  average  computation  time  will 
lie  somewhere  between  the  minimum  and  maximum  time.  If  this  were  a  synchronous 
circuit,  the  computation  time  would  be  fixed  at  the  worst-case  time.  Therefore,  the  delay 
insensitive  design  methodology  was  applied  and  completion  detection  circuitry  was 
added. 

Implementing  completion  circuitry  adds  complexity,  size,  and  increases  the 
required  power.  The  initial  1-bit  ALU  design  used  in  the  previous  research  was  the 
starting  point  for  this  design  [2].  It  is  based  on  a  dual-rail  asynchronous  adder  unit  design 
developed  at  the  University  of  Manchester  [15].  The  principle  of  operation  defines  that 
each  1-bit  stage  (shown  in  Figure  4-8)  of  the  adder  will  either  have  a  carry  out  or  no  carry 
out.  These  signals  are  designated  as  COUT  and  NOCOUT.  When  either  of  these  lines  is 
raised,  the  stage  can  be  considered  “done.”  The  ALU  reports  completion  with  an  ACK 
signal  when  all  stages  are  done. 

The  ALU  developed  in  the  previous  effort  relied  on  a  ripple  type  reset  of  the 
circuit  after  each  calculation  to  clear  out  the  COUT  and  NOCOUT  values  for  each  stage 
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[2].  In  this  design,  an  alternative  was  chosen,  which  resets  the  COUT  and  NOCOUT 
signals  with  a  pair  of  NOR  gates  that  are  tied  to  ALU  16  REQ  signal  after  each 
calculation  to  improve  the  overall  execution  time  of  the  ALU. 


Three  additional  components  are  necessary  to  complete  the  design  of  the  ALU. 
To  make  the  unit  capable  of  subtracting,  the  exclusive  OR  function  (XOR)  is  applied  to 
one  of  the  input  lines  (B  was  arbitrarily  chosen)  with  the  ADD/SUB  line  being  the  other 
input  line.  Adding  the  XOR  gate  simply  inverts  the  B  value.  Coupled  with  the 
component  described  in  the  next  paragraph,  the  ALU  can  execute  the  subtract  function. 

An  initialization  stage  was  necessary  to  set  the  first  stage  CIN  or  NOCIN  values 
according  to  the  ADD_SUB  and  REQ  values.  The  truth  table  is  shown  in  Table  4-2  and 
the  corresponding  circuit  is  shown  in  Figure  4-9. 
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Table  4-2.  ALU  Initialization  Truth  Table 


REQ 

ADD_SUB 

COUT 

NOCOUT 

0 

X 

0 

0 

1 

0 

0 

1 

1 

1 

1 

0 

t^DD_SUBV 


r>=t> 


"flQCQUT  > 


Figure  4-9.  ALU  Initialization  Stage  Circuit 


The  ALU16  consists  of  one  ALU  initialization  stage  along  with  16  1-bit  ALUs. 
The  NOCOUT  and  COUT  from  each  stage  is  connected  to  the  next  stage’s  NOCIN  and 
CIN.  A  NOR-NAND  tree  is  used  to  combine  the  done  signals  from  each  stage  into  the 
ALU16ACK  signal. 


4.2.3  Output  Multiplexor 

To  control  which  output  of  the  second  stage  of  ALUs  is  seen  on  the  FFT-4  output, 
a  mux  is  used.  A  basic  4x1  mux  is  constructed  from  three  library  cell  2x1  muxes. 

Sixteen  of  these  muxes  are  combined  to  handle  all  the  output  values.  Figure  4-10  shows 
the  4x1  mux  design. 
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Figure  4-10.  4x1  Mux 

Similar  to  the  input  latches,  a  delay  element  is  used  to  implement  the  ACK  signal 
for  the  output  multiplexor.  This  type  of  delay  scheme  is  appropriate  because  there  is  very 
little  variation  in  completion  time  when  switching  between  the  four  mux  states.  Although 
a  delay  element  is  used  in  the  fabricated  FFT-4  design,  later  investigation  found  that  a 
delay  element  on  a  mux  could  be  eliminated.  Changing  the  states  early  enough  in  the 
control  sequence  assures  that  the  proper  values  are  seen  on  the  output  when  the  REQOUT 
signal  is  given  for  the  FFT-4. 

4.2.4  FFT-4  Control  Units 

Development  of  the  AFSMs  necessary  for  the  FFT-4  was  an  iterative  process. 
Originally,  an  input  and  an  output  controller  were  designed  in  behavioral  VHDL.  They 
were  coupled  so  that  new  input  data  could  not  overwrite  output  data  not  yet  sent  out. 
These  controllers  turned  out  to  be  too  large  to  be  synthesized  by  3D.  The  controllers 
were  broken  up  using  Shannon  decomposition  until  they  could  be  synthesized  and 
simulated  successfully  at  the  stmctural  VHDL  level.  The  final  design  has  nine  small 
controllers,  which  are  all  interconnected. 
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4.2.5  Test  Multiplexor 

For  the  fabricated  FFT-4  design,  a  test  mux  was  added  to  the  design  to  allow 
probing  of  internal  signals  by  selecting  a  bank  of  test  signals  and  viewing  the  output 
through  bi-directional  pads.  Sixteen  4x1  multiplexors  were  used  to  allow  testing  of  64 
internal  signals. 

The  selection  of  the  test  signals  was  based  on  what  would  be  needed  to  isolate  a 
problem  if  the  test  chip  failed.  All  of  the  AFSM  control  signals  were  included  to  test  for 
any  control  path  problems.  Two  output  data  bits  and  their  complete  upstream  paths  were 
included  to  test  any  data  path  problems.  The  signals  that  can  be  observed  are  listed  in  the 
specification  sheet  of  the  FFT-4  chip  in  Appendix  B. 

4.2.6  Final  FFT-4  Design 

It  became  apparent  when  assembling  the  components  for  the  FFT-16  that  the 
design  of  the  FFT-4  had  to  be  changed.  Although  this  resulted  in  a  difference  between 
the  fabricated  chip  design  and  the  final  design  used  in  the  FFT-16,  the  test  chip  still 
effectively  serves  the  purpose  of  confirming  that  the  controllers,  ALUs  and  radiation 
tolerant  library  function  properly. 

The  new  FFT-4  design  used  in  the  FFT-16  is  the  same  basic  structure  that  was 
presented  in  Figure  4-4.  However,  several  modifications  were  made  for  better  integration 
into  the  FFT-16  design.  The  control  sequence  was  modified  to  allow  the  in  order  input  of 
data.  The  output  mux  was  split  in  half  to  make  two  32x16  muxes,  allowing  the  output  of 
the  real  and  imaginary  results  at  the  same  time.  To  handle  the  changes,  four  simpler 
controllers  replaced  the  original  nine  controllers.  The  AFSM  descriptions  of  the 
controllers  are  given  in  Appendix  C.  One  of  the  controllers  was  reused  while  three  new 
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controllers  were  developed.  The  only  drawback  to  these  modifications  is  that  the  output 
order  becomes  Z(0),  X{2),  X(l),  Z(3)  to  achieve  minimum  switching.  Both  the  crossbar 
and  the  reorder  register  handle  this  out-of-order  sequence. 

Another  enhancement  made  was  additional  fanout  corrections  throughout  the 
circuit.  The  final  FFT-4  design  is  almost  exactly  the  same  area  as  the  original  FFT-4 
shown  in  Figure  4-4.  Figure  4-1 1  shows  the  layout  of  the  final  FFT-4  design. 


Re  {x(0)} 
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Im  {X(1)} 
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Figure  4-11.  Final  FFT-4  Design 
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The  overall  throughput  time  of  the  final  FFT-4  design  is  75%  faster  than  the  first 
design  in  this  effort  due  to  the  availability  of  the  real  and  complex  components  at  the 
same  time.  The  area  of  the  final  design  is  slightly  smaller  due  to  the  reduced  number  of 
controllers  and  muxes. 

4.3  Complex  Multiplier 

The  design  of  the  complex  multiplier  used  in  the  previous  research  was  based  on  a 
design  developed  at  AFTT  [2]  [34].  The  design  is  based  on  the  Booth  multiplication 
algorithm,  which  is  suitable  for  asynchronous  implementation  [35]. 

The  design  pursued  in  this  effort  employs  a  radix-4  Booth  encoded  scheme  in  the 
real  multipliers  and  has  no  component  reuse.  This  design  is  twice  as  fast  as  the  radix-2 
implementation  with  a  trivial  impact  to  area,  due  to  a  slightly  more  complex  controller. 
The  complex  multiplier  also  has  a  provision  for  directly  forwarding  the  input  to  the 
output  for  a  fast  multiply  by  one,  which  saves  power  and  gives  a  higher  throughput.  This 
was  implemented  because  one  of  the  four  constants  in  each  constant  bank  (described  in 
Section  4.4)  is  one.  The  complex  multiplier  is  composed  of  four  16-bit  real  multipliers, 
one  subtract  unit,  one  add  unit  and  a  multiply  by  one  mux  to  accomplish  the  complex 
multiply  operation  in  Equation  3-27.  Figure  4-12  shows  the  components  of  the  complex 
multiplier. 


/ 
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The  radix-4  Booth  encoding  was  chosen  over  the  baseline  radix-2  encoding 
because  it  reduces  the  maximum  possible  ALU  operations  from  16  to  8  for  16-bit  data. 
Higher  order  radix  algorithms  are  possible,  but  make  the  design  more  complex  [35]. 

Multiplying  two  16-bit  words  together  produces  a  32-bit  result.  Realizing  that 
this  complicates  the  design  by  doubling  the  data  width  of  components  downstream  from 
the  multiplier,  the  output  of  the  multiplier  was  normalized.  Normalization  is  achieved  by 
selecting  the  16  output  signals  that  represent  the  original  data  format  shown  in  Figure 
3-3.  Overflow  is  not  a  concern,  as  the  inputs  are  always  fractional  in  an  FFT  producing  a 
result  less  than  the  input  magnitude. 
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The  main  components  of  the  complex  multiplier  are  the  2’s  complement  radix-4 
Booth  multiply  units,  the  add  unit  and  the  subtract  unit.  These  three  components  are 
described  in  detail  in  the  next  sections. 

4. 3. 1  Multiply  by  One  Mux 

A  set  of  muxes  on  the  output  of  the  complex  multiplier  allows  the  direct 
forwarding  of  the  X,„  value  to  the  output  under  a  multiply  by  one  condition.  This 
condition  is  set  by  the  constant  bank,  which  occurs  once  every  fourth  multiply. 

4. 3. 2  Add  and  Subtract  ALUs 

Two  dedicated  ALUs  were  modified  for  use  with  the  complex  multiplier.  They 
are  identical  to  the  ALU  16  design  in  the  FFT-4.  Because  each  ALU  performs  an  add  or 
subtract,  useless  gates  were  removed  from  each  ALU  to  quicken  its  operation  and  reduce 
its  size. 

4. 3. 3  Radix-4  Booth  Encoded  Fixed-Point  Multiplier 

The  heart  of  the  complex  multiplier  is  an  arrangement  of  four  fixed-point  multiply 
units  which  multiply  two  2’s  complement  16-bit  numbers  and  produce  a  single  16-bit 
result.  The  radix-4  Booth  encoded  multiplier  is  composed  of  six  basic  units.  There  is  a 
34-bit  shift  register,  a  modified  ALU,  a  modified  latch,  a  2x  multiply  unit,  a  Booth 
decoder  and  three  controllers.  These  components  are  connected  as  shown  in  Figure  4-13. 
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Figure  4-13.  Booth  Multiplier  Block  Diagram 

4.3.3. 1  34-bit  Shift  Register 

The  34-bit  shift  register  is  composed  of  20  resetable  and  14  non-resetable  D  flip- 
flops.  Additional  logic  was  added  to  allow  the  reset  or  loading  of  the  top  17  bits  and 
loading  of  the  lowest  17-bits.  A  shift  signal  causes  the  register  to  do  an  arithmetic  shift 
(with  MSB  sign  extension)  right  by  two.  The  resetable  D  flip-flops  are  used  to  initialize 
the  multiplier  to  a  known  state  upon  reset.  They  also  zero  the  value  of  the  top  17  bits  for 
each  new  multiply.  Figure  4-14  shows  the  interconnection  of  a  few  D  flip-flops,  as  the 
entire  schematic  is  too  large  to  show  here. 
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Figure  4-14.  34-Bit  Shift  Register  Block  Diagram 


4. 3. 3.2  Booth  Decoder 

A  Booth  decoder  was  designed  to  implement  the  radix-4  Booth  algorithm.  It  is  a 
logic  block  that  asserts  2X,  ADDSUB  and  SHIFTONLY  signals  by  looking  at  the  least 
significant  three  bits  of  the  34-bit  shift  register.  The  tmth  table  for  the  radix-4  algorithm 
is  shown  in  Table  4-3  and  logic  is  shown  in  Figure  4-15  [35]. 


Table  4-3.  Radix-4  Booth  Algorithm 


Bit  2 

Bit  1 

Bit  0 

Operation 

0 

0 

0 

0 

0 

1 

Add  1  X  multiplicand,  shift 

0 

1 

0 

Add  1  X  multiplicand,  shift 

0 

1 

1 

Add  2  X  multiplicand,  shift 

1 

0 

0 

Subtract  2  x  multiplicand,  shift 

1 

0 

1 

Subtract  1  x  multiplicand,  shift 

1 

1 

0 

Subtract  1  x  multiplicand,  shift 

1 

1 

1 
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Figure  4-15.  Radix-4  Booth  Decoder 


The  MUL2X  signal  controls  the  operation  of  the  2X  multiply.  The  ADDSUB 
signal,  which  goes  to  the  ALU,  is  simply  Bit  2.  The  SHIFT  signal  serves  as  a  masking 
signal  to  the  controllers  in  a  SHIFT  only  condition  (000  or  1 1 1)  to  simplify  the  controller 
design.  It  was  found  in  the  design  process  that  without  masking  the  control  signals  in  this 
case,  the  controllers  would  have  to  handle  this  exception,  which  increases  the  size  of  the 
controllers. 


43.3.3  Modified  ALU 

The  basic  asynchronous  ALU  16  used  in  the  FFT-4  was  modified  for  use  in  the 
multiplier.  An  additional  stage  was  added  to  make  the  adder  a  17-bit  adder  (ALU17). 
This  additional  bit  was  needed  to  handle  sign  extensions,  which  occurs  in  Booth 
multiplication. 


4. 3. 3. 4  Modified  Latch 

Similar  to  the  modified  ALU  in  the  previous  section,  the  LATCH  16  used  in  the 
FFT-4  was  extended  by  one  bit  to  make  a  LATCH17  (stores  17-bits).  This  was  necessary 
to  latch  the  data  from  the  upper  17  bits  of  the  shift  register. 
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4.3.3. 5  2X Multiply  Unit 


This  unit  executes  a  multiply  by  2  depending  on  the  output  of  the  Booth  decoder. 
The  multiply  by  two  is  done  through  17  muxes  with  a  wired  sign  extension. 

4. 3. 3. 6  Multiply  Control 

The  multiply  control  is  divided  into  three  simple  control  blocks.  The  top 
controller  interfaces  with  the  external  MULTREQ  and  MULTACK  signals.  From  these 
external  signals,  the  two  remaining  controllers  are  driven  to  complete  the  Booth 
multiplication.  In  the  case  of  the  “shift  only”  condition,  the  ALU  operation  of  the 
controller  is  masked  by  minimum  delay  elements  to  allow  for  a  quick  shift  by  two.  The 
AFSM  state  tables  are  shown  in  Appendix  C. 


4.4  Constant  Banks 

There  are  three  constant  banks  that  store  the  values  of  the  Wi6  matrix  presented  in 
Equation  3-2.  Each  bank  produces  four  constants.  Due  to  the  symmetry  of  FFTs,  many 
of  the  constants  are  the  same  or  are  the  negative  of  other  values.  The  values  in  the 
constant  banks  are  listed  in  Table  4-4, 4-5,  and  4-6.  The  decimal  values  shown  are  the 
closest  representation  of  the  irrational  values  given  12-bit  precision. 

The  constant  banks  were  implemented  with  combinational  logic  controlled  by  the 
two-bit  Johnson  counter  shown  in  Figure  2-6.  The  COUNT  signal  is  fed  by  the 
ACKOUT  signal  of  the  FFT-4  that  supplies  the  complex  multiplier  data.  When  the  first 
number  to  be  multiplied  comes  through,  the  counter  has  a  value  of  00.  The  counter  value 
determines  what  constant  will  be  fed  into  the  complex  multiplier.  Figure  4-16  shows  the 
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logic  for  determining  the  constants  fed  into  CPLXMULTl,  CPLXMULT2,  and 
CPLXMULT3,  respectively.  The  output  values  represented  as  A  through  K  and 
MULBYl  are  wired  to  the  inputs  of  the  complex  multiplier  to  form  the  appropriate 
constant  values.  Logic  was  not  necessary  to  represent  all  16  combinations,  as  10 
combinations  were  enough  to  express  the  constants. 


Table  4-4.  Constant  Bank  One  Values 


Binary  Equivalent 

Decimal  Equivalent 

Constant 

Real 

Real 

1 

1000 

0000 

1.000000000000 

0.000000000000 

0B50 

F4B0 

0.707031250000 

-0.707031250000 

w 

0EC8 

F9E1 

0.923828125000 

-0.382568359375 

06  IF 

F138 

0.382568359375 

-0.923828125000 

Table  4-5.  Constant  Bank  Two  Values 


Binary  Equivalent 

Decimal  Equivalent 

Constant 

Real 

Real 

1 

1000 

1.000000000000 

0.000000000000 

-J 

0000 

0.000000000000 

-1.000000000000 

0B50 

F4B0 

0.707031250000 

-0.707031250000 

F4B0 

F4B0 

-0.707031250000 

-0.707031250000 

Table  4-6.  Constant  Bank  Three  Values 


Binary  Equivalent 

Decimal  Equivalent 

Constant 

Real 

Imaginary 

Real 

Imaginary 

1 

1.000000000000 

0.000000000000 

w 

-0.707031250000 

-0.707031250000 

06  IF 

F138 

0.382568359375 

-0.923828125000 

-VK 

F138 

061F 

-0.923828125000 

0.382568359375 
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Figure  4- 1 6 .  Constant  B  anks 


4.5  Decimator 

The  decimator  is  a  controller  which  interfaces  the  external  FFT-16 
REQIN/ACKIN  to  the  first  stage  FFT-4  input  request  signals.  Initially,  an  AFSM  was 
synthesized  from  3D  that  handled  the  control  signals.  The  AFSM  was  discarded  for  a 
more  simple  design  that  uses  a  two  bit  Johnson  counter  as  shown  in  Figure  4-17.  The 
COUNT  signal  of  the  counter  is  fed  by  the  ACKIN  signal.  Each  full  ACKIN  transition 
increments  the  count  after  every  four-cycle  handshake.  The  count  controls  the  MUX4X1 
and  selector  circuit  which  enables  the  passing  of  the  FFT-16  REQIN/ACKIN  signals  to 
the  appropriate  FFT-4.  Data  is  passed  to  the  input  state  FFT-4s  through  a  shared  bus 
corrected  for  fanout. 
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Figure  4-17.  Decimator  Gate-Level  Schematic 


4.6  Expander 

Similar  to  the  decimator,  the  expander  interfaces  the  output  stage  FFT-4s  output 
request  signals  to  the  ITT- 16  output  request  signals.  The  COUNT  signal  of  the  two-bit 
Johnson  counter  is  fed  by  the  FFT-16  ACKOUT  signal.  The  counter  bits  control  a 
MUX4X1  and  selector  circuit  which  connects  the  correct  output  stage 
REQOUT/ACKOUT  pair  to  the  FFT-16  REQOUT/ACKOUT  external  signals.  The  SEL 
bits  also  control  an  output  mux  which  routes  the  data  from  the  appropriate  output  stage 
FFT-4  to  the  output.  The  expander  schematic  is  shown  in  Figure  4-18. 
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Figure  4-18.  Expander  Gate-Level  Schematic 


4.7  “Crossbar” 

The  “crossbar”  handles  the  interconnection  between  the  input  stage  output  request 
signals  and  the  output  stage  input  request  signals.  The  crossbar  collects  one  data  word 
from  each  input  stage  FFT-4  (0  through  3)  and  sends  them  in  order  to  the  first  output 
stage  FFT-4  (4).  The  cycle  is  repeated  for  the  remaining  output  stage  FFT-4s  (5  through 
7). 

The  first  component  designed  for  the  crossbar  was  a  divide  by  four  circuit  as 
shown  in  Figure  4-19.  This  divider  was  used  to  handle  the  difference  in  sampling 
frequency  per  FFT-4  between  the  input  stage  and  output  stage.  The  divider  signaled  the 
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change  in  routing  signals  to  the  output  stage  FFT-4s  once  for  every  four  input  stage  FFT- 
4  output  requests. 


Figure  4-19.  Divide  By  Four  Schematic 


A  control  circuit  similar  to  an  expander  control  circuit  is  used  to  funnel  down  the 
input  stage  control  signals  to  a  single  REQOUT/ACKOUT  pair.  This  expander  also 
drives  four  write  enable  lines  that  are  connected  to  tri-state  buffers  that  control  which 
FFT-4  is  allowed  to  write  its  post-multiplied  data  to  a  shared  bus.  The  output  of  the 
divide  by  four  block,  which  is  driven  by  the  ACKOUT  signal  of  the  re-used  expander,  is 
fed  into  another  two-bit  Johnson  counter  which  facilitates  the  four  value  input  into  each 
output  stage  FFT-4. 

Because  the  outputs  of  the  FFT-4  appear  out  of  order,  the  crossbar  is  used  to 
correct  the  order  of  the  data  by  switching  the  FFT-4  (5)  and  FFT-4  (6)  request  lines.  This 
eliminates  the  need  to  correct  the  output  sequence  from  the  input  stage  FFT-4s.  The  next 
section  covers  the  reorder  register,  which  is  used  to  correct  the  output  sequence  of  the 
output  stage  FFT-4s.  The  crossbar  schematic  is  shown  in  Figure  4-20. 
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Figure  4-20.  Crossbar  Schematic 


4. 8  Reorder  Register 

The  reorder  register  is  a  necessary  component  to  correctly  order  the  output  of  the 
FFT-4s  on  the  output  stage.  As  mentioned  previously,  the  crossbar  handles  the  output 
order  sequence  problem  of  the  input  stage  FFT-4s.  The  reorder  register  is  simply  a  pair 
of  registers  and  muxes  coupled  to  a  small  controller.  The  controller  allows  the  first  data 
word  (Z(0))to  pass  straight  through.  The  second  REQIN  latches  the  input  data  (Z(2)),  but 
does  not  give  a  REQOUT  signal.  The  third  REQIN  along  with  the  input  data  (Z(l))is 
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passed  straight  though  as  in  the  first  case.  The  fourth  REQIN  triggers  the  mux  to  output 
the  value  in  the  register,  which  is  X(2).  The  controller  then  requests  that  the  data  on  the 
input  bus,  X(3)  be  passed  through.  This  way  the  final  output  sequence  of  the  FFT-4  is 
changed  to  X(0),  X(l),  X{2),  X(3).  The  component  level  schematic  is  shown  in  Figure 
4-21. 


DATA  INR 


Figure  4-21.  Reorder  Register  Schematic 


4.9  Design  Implementation  Conclusion 

Each  major  component  was  described  in  detail  in  this  chapter,  along  with  other 
designs  considered  for  each  component.  The  FFT-4,  complex  multiplier,  decimator,  and 
expander  all  work  together  to  achieve  the  top-level  FFT-16  functionality  as  shown  in 
Figure  4-2.  A  reorder  register  was  added  to  correct  the  output  order  of  the  FFT-4.  The 
next  chapter  analyzes  the  components  at  all  levels  of  simulation. 
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5.  Results 


5.1  FFT-4  Test  Chip 

The  FFT-4  test  chip  did  not  return  from  fabrication  in  time  for  the  test  results  to 
be  included  in  this  thesis.  The  purpose  of  the  test  chip  was  to  validate  that  the 
asynchronous  design  methodologies  chosen  and  the  radiation  tolerant  library  performed 
as  expected.  The  FFT-4  test  chip  will  be  evaluated  at  a  later  date  to  determine  if  any 
corrections  need  to  be  made  to  the  design  of  the  FFT-16  before  it  is  fabricated. 


5.2  Simulation  Results 

This  section  presents  the  simulation  results  of  each  major  component  and  the  top- 
level  FFT-16  design.  VHDL  simulation  was  used  to  verify  proper  operation  and  derive 
rough  estimates  of  component  timing  information.  Back-annotated  timing  data  obtained 
from  HSPICE  simulations  of  each  logic  gate  was  used  to  describe  gate  delays  to  the 
VHDL  simulator.  IRSIM  was  used  to  simulate  the  performance  of  the  layout  and  verify 
proper  operation  [13].  IRSIM  is  a  mixed-mode  simulator,  which  analyzes  circuit 
extraction  data  and  gives  a  good  functionality  check  as  well  as  more  realistic  timing 
information.  Finally,  the  HSPICE  results  give  the  most  accurate  results  possible, 
including  power  information,  which  is  not  available  from  either  VHDL  or  IRSIM. 

High-efficiency  and  performance  were  the  end  goals  for  the  FFT-16.  However, 
correct  circuit  operation  and  more  importantly,  correct  results,  was  an  implied 
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requirement.  To  demonstrate  that  the  FFT-16  did  indeed  correctly  transform  a  time 
domain  sequence  into  a  frequency  domain  sequence,  multiple  test  cases  were  used  for 
validation.  One  simple  test  case  is  the  input  of  an  impulse  function.  Mathcad  [36]  was 
used  to  calculate  the  expected  output  stream.  The  same  impulse  function  was  given  to 
the  FFT-16  and  simulated  in  VHDL  and  IRSIM.  Both  VHDL  and  IRSIM  produced  the 
same  results,  as  shown  in  Figures  D-27  and  D-28.  Table  5-1  shows  the  input  time- 
domain  sequence  x(n).  Table  5-2,  Figure  5-1  and  Figure  5-2  show  the  comparison  of  the 
output  frequency-domain  sequence  X(m).  This  test  sequence  clearly  demonstrates  that 
the  FFT-16  does  produce  the  correct  X(m)  for  a  given  x(n)  with  a  small  error  generated 
due  to  the  selected  fixed-point  data  format. 

Table  5-1.  Impulse  Function  Input  Sequence 


n 

x(n) 

1 

1 

2  through  15 

0 

Table  5-2.  Mathcad  vs.  Simulation  Results 


Mathcac 

Results 

Simulation  Results 

Maximum 
Error  (%) 

m 

RefXfm)} 

Jm{X(m)} 

Re{X(m)} 

Im{X(mj} 

0 

1.000000000000 

0.000000000000 

0.000000000000 

0.00000 

1 

0.923879532511 

-0.382683432365 

-0.382568359375 

-0.00556 

2 

0.707106781187 

-0.707106781187 

0.707031250000 

-0.707031250000 

-0.01068 

3 

0.382683432365 

0.382568359375 

-0.923828125000 

-0.00556 

4 

0.000000000000 

0.000000000000 

-1.000000000000 

0.00000 

5 

-0.382683432365 

-0.923879532511 

-0.382568359375 

-0.923828125000 

6 

-0.707106781187 

-0.707106781187 

-0.707031250000 

-0.707031250000 

7 

mimss^m 

-0.382683432365 

-0.923828125000 

-0.382568359375 

8 

0.000000000000 

-1.000000000000 

0.000000000000 

9 

0.382683432365 

-0.923828125000 

0.382568359375 

10 

-0.707106781187 

0.707106781187 

-0.707031250000 

0.707031250000 

11 

-0.382683432365 

0.923879532511 

-0.382568359375 

0.923828125000 

-0.00556 

12 

1.000000000000 

1.000000000000 

13 

0.923879532511 

0.923828125000 

14 

0.707106781187 

0.707106781187 

0.707031250000 

0.707031250000 

15 

0.923879532511 

0.382683432365 

0.923828125000 

0.382568359375 
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X(m)  X(m) 


FFT-16  Theoretical  vs.  Simulation  Data  (Real  Component) 


Theoretical 

Simulation 


m 


Figure  5-1.  Theoretical  vs.  Simulation  Data  (Real  Component) 


FFT-16  Theoretical  vs.  Simulation  Data  (imaginary  Component) 


m 


Figure  5-2.  Theoretical  vs.  Simulation  Data  (Imaginary  Component) 
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Table  5-3  gives  a  summary  of  the  FFT-16  design  statistics  and  the  performance 
obtained  through  simulation.  Component  latencies  are  given  for  the  worst-case 
propagation  time  through  the  component.  The  throughput  is  given  for  the  FFT-4  and 
FFT-16.  The  energy  measurement  for  each  component  is  the  energy  required  for  that 
component  in  a  single  FFT-16  calculation.  The  computational  resources  were  not 
available  to  run  a  top-level  FFT-16  HSPICE  simulation,  so  the  FFT-16  energy 
requirement  was  determined  by  the  sum  of  the  energy  required  for  each  component 
multiplied  by  the  number  of  instances.  This  energy  measurement  gives  an  efficiency  of 
28  nJ/Unit  Transform  for  the  FFT-16.  Appendix  D  contains  the  simulation  waveforms 
used  to  obtain  the  data  in  Table  5-3. 

Table  5-3.  FFT-16  Design  and  Simulation  Results 


Design  Statistics 

Component  Latency 
(ns) 

Throughput 
_  (ns) 

Component 

Area  (lim^) 

Transistors 

VHDL 

IRSIM 

HSPICE 

IRSIM 

HSPICE 

Decimator 

41,000 

271 

1.0 

1.0 

1.0 

- 

0.26 

1,143 

1.0 

1.0 

2.0 

- 

■eeh 

271 

1.0 

1.0 

1.0 

- 

W>MJ1 

Reorder  Reg. 

13 

16 

9.0 

- 

FFT-4 

80 

180 

17 

mmm 

WBSBSi 

85 

110 

- 

98 

FFT-16 

45,000,000 

271,908 

585 

980 

- 

760 

440 

The  area  and  transistor  count  is  also  given  for  each  component  and  the  top-level 
circuit.  Because  a  two-metal  channel  router  was  used  to  route  the  circuits,  the  actual  area 
that  can  be  realized  from  a  high  performance  three-metal  over-the-cell  router  will  reduce 
the  area  by  50%.  A  simulation  of  the  re-routed  circuit  will  also  show  a  slight  decrease  in 
the  energy  consumption  of  the  circuit  due  to  the  optimized  routing  which  lowers  the  trace 
capacitance  throughout  the  circuit. 


I 
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The  worst  case  throughput  time  of  760  ns  and  efficiency  of  28  nJ/Unit  transform 
of  the  FFT-16  must  be  extrapolated  to  an  FFT-1024  for  comparison  with  the  FFT 
processors  presented  in  Chapter  Two.  The  throughput  time  of  the  FFT-1024  is  derived 
by  dividing  the  latency  of  the  slowest  block  in  the  FFT-1024  (complex  multiplier)  by  the 
decimation  at  the  top  level  (64).  This  gives  a  throughput  of  2  |xs.  The  efficiency  is 
calculated  by  summing  the  energy  of  the  components  required  for  calculating  a  single 
point  of  an  FFT-1024.  This  gives  an  extrapolated  FFT-1024  efficiency  of  120  nJ/Unit 
Transform.  Table  2-2  is  presented  again  here  with  the  new  FASST  extrapolated  data  as 
Table  5-4. 

Table  5-4  shows  that  the  design  produced  in  this  thesis  effort  has  roughly  the 
same  performance  when  compared  to  the  previous  thesis  effort.  However,  the  results  of 
this  research  should  be  considered  more  reliable,  as  the  results  of  the  previous  thesis  were 
extrapolated  from  a  6-bit  design. 

In  conclusion,  the  FASST  architecture  combined  with  asynchronous  design  and  a 
radiation  tolerant  library  indeed  produced  a  design  that  is  acceptable  for  space,  as  it  offers 
two  orders  of  magnitude  improvement  over  typical  space-based  FFT  processors  in  both 
throughput  time  and  efficiency. 
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Table  5-4.  Final  Comparison  of  FFT  Processors 


Processor 

Name 

Design  Feature 

Dataword 
Size  &  Type 

Supply 
Voltage  (V) 

FFT-1024 
throughput 
time  (|xs) 

Efficiency 

(nJ/Unit 

Transform) 

C40 

Space 

32-Bit 

Floating 

Point 

5.0 

1298 

5704 

Spiffee 

Low  Power 

20-bit 
Fixed  Point 

3.3 

30 

24.7 

COBRA 

Speed 

23-bit 
Fixed  Point 

5.0 

9.5 

71.4 

FASST 

(previous) 

■til 

16-bit 

Fixed  Point 

3.3 

10 

120 

FASST 

(new) 

Asynchronous, 
space  &  low 
power 

16-bit 

Fixed  Point 

3.3 

2 

120 
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6.  Summary  and  Conclusions 


6.1  Summary 

This  thesis  has  presented  the  steps  taken  to  develop  an  energy-efficient  high- 
performance  asynchronous  FFT-16  designed  for  space.  Lessons  learned  from  previous 
research  were  used  as  a  starting  point.  New  concepts  and  designs  were  developed. 
Finally,  all  of  the  components  were  integrated  to  achieve  the  top-level  FFT-16  design  and 
simulations  were  used  to  validate  correct  operation  and  to  evaluate  performance. 


6.2  Conclusions 

The  results  presented  in  Chapter  Five  clearly  demonstrate  that  an  asynchronous 
implementation  of  the  FASST  architecture  can  potentially  facilitate  the  design  of  large- 
point  FFT  processors  suitable  for  the  space  environment.  The  estimated  throughput  time 
of  2  jis  and  efficiency  of  120  nJAJnit-Transform  for  an  FFT-1024  offers  an  improvement 
of  two  orders  of  magnitude  over  existing  space-based  FFT  processors.  It  is  even  more 
impressive  to  realize  that  this  design  is  highly  competitive  with  similar  terrestrial  designs! 
The  significant  efficiency  and  performance  improvements  are  justification  alone  to 
continue  this  area  of  research  and  build  larger  point  size  FFT  processors. 
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6.3  Lessons  Learned 


Throughout  the  course  of  the  design  process,  many  lessons  were  learned. 
Specifically,  some  valuable  lessons  were  learned  by  using  the  MRC  library  and  the 
MOSIS  [37]  service. 

The  use  of  the  MRC  library  presented  some  interesting  design  challenges.  The 
only  available  component  of  the  library  at  the  time  of  design  was  the  physical  layout  of 
the  cells.  HSPICE  simulation  data  of  the  cells  and  generic  symbols  were  used  to  build  a 
VHDL  library  for  structural  simulation  with  timing  information.  Fanout  had  to  be 
manually  compensated  for  throughout  the  design,  as  no  library  file  was  available  for  the 
Synopsys  Design  Analyzer,  although  one  could  have  been  developed.  An  interface  also 
had  to  be  built  to  the  Lager  Octtools  layout  tool  for  the  place  and  route  phase  of  the 
design. 

Future  design  efforts  with  the  radiation  tolerant  cells  will  benefit  from  the  library 
files  produced  by  MRC  when  they  are  available.  The  VLSI  lab  will  also  greatly  benefit 
from  using  an  advanced  cell  router  to  take  advantage  of  the  ability  of  the  radiation 
tolerant  cells  to  be  densely  packed. 

An  interesting  lesson  learned  by  using  the  MOSIS  service  was  discovering  the 
difference  between  technology  layout  rules  and  process  specific  fabrication  rules  [37]. 
Upon  first  submittal  to  the  MOSIS  service,  the  test  chip  design  was  rejected  due  to 
“oversized  features  on  the  contact  layer.”  Discussing  the  problem  with  the  MOSIS 
representative  revealed  that  the  HP  0.5  pm  process  requires  that  any  feature  on  the 
contact  layer  (meaning  poly  contacts  or  metal  vias)  be  no  larger  or  no  smaller  than  the 


6-2 


minimum  size.  The  design  rule  check  (DRC)  feature  of  MAGIC  [13]  only  checks  if  a 
feature  is  too  small  or  too  closely  spaced,  not  if  it  is  too  large  [13].  This  is  due  to  the  fact 
that  some  fabrieation  processes  take  oversized  contact  features  and  break  them  up 
automatically  into  minimum  sized  contacts.  Correcting  the  oversized  contacts  resulted  in 
a  successful  submission. 


6.4  Recommendations  for  Future  Research 

Clearly,  the  next  step  for  research  in  this  area  is  to  implement  larger  point  sizes 
with  the  FASST  architecture.  There  is  also  a  need  to  improve  the  asynchronous  design 
flow. 

The  next  logical  point  size  to  be  developed  is  the  FFT-256.  With  the  FFT-16 
building  block  developed  in  this  research,  a  FASST  FFT-256  can  be  developed  with 
Ni=N2=\6.  The  complex  multiplier  design  should  be  resized  to  meet  the  performance 
criteria  of  the  FFT-256.  New  decimator,  crossbar  and  expander  elements  would  have  to 
be  developed  also.  The  physical  area  of  the  FFT-256  will  be  much  larger  than  the  FFT- 
16  and  will  require  the  use  of  a  high  performance  router  to  get  the  greatest  cell  density 
possible  and  reduce  long  metal  trace  issues  present  in  larger  designs. 

An  improved  asynchronous  design  flow  has  been  the  goal  of  the  growing 
asynchronous  design  community  [3].  This  field  of  research  is  wide  open  for  new  design 
concepts,  implementations  and  tools.  One  facet  of  the  design  process  that  could  be 
improved  upon  with  the  tools  available  in  the  AFTT  VLSI  lab  is  the  interface  with  the  3D 
tool.  A  simple  interface  program  could  be  developed  to  convert  behavioral  VHDL 
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written  in  a  pre-defined  format  to  the  input  file  format  of  the  3D  asynchronous  tool. 
Then,  from  the  output  of  the  3D  tool,  another  translator  program  could  be  written  to 
convert  the  output  file  (which  is  in  positive  logic)  into  a  VHDL  format.  The  Synopsys 
Design  Analyzer  or  other  such  tool  could  then  convert  the  positive  logic  VHDL  into 
negative  logic  structural  VHDL  and  correct  for  fanout.  This  design  process  was 
manually  repeated  throughout  the  work  of  this  thesis.  Such  a  set  of  interface  programs 
would  have  been  of  great  benefit  and  would  have  allowed  rapid  prototyping  of  new 
designs. 
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Appendix  A.  Layout  of  the  Fabricated  FFT-4  Design 


Appendix  B.  FFT-4 IC  Specification  Sheet 


Table  B-1.  FFr-4  Test  Chip  Specifications 


Package  Type: 

PGA65 

Supply  Voltage: 

3.3  VDC 

Average  Power  (Simulated): 

41.5  mW 

Core  Transistor  Count: 

21,951 

Total  Chip  Area: 

2934x2934  pm 

Data  Input  Sequence: 

A:  Re{x(0)} 

B:  Re{x(2)} 

A:  Re{jc(l)} 

B:  Re{x(3)} 

A:  Im{x(l)} 

B:  Im{x(3)} 

A:  lm{x(0)} 

B:  Im{x(2)} 

Data  Output  Sequence: 

Re{X(0)} 

Re{X(2)} 

Re{Z(l)} 

Re{X(3)} 

Im{X(l)}* 

Im{X(3)}* 

Im{X(0)}* 

Im{X(2)}* 

*Note:  Im  values  should  be  mu: 
get  the  correct  value. 

tiplied  by  (-1)  to 

Testing  Procedure: 

1 .  Refer  to  Table  B-2  for  pin  names. 

2.  Apply  power,  control  and  data  connections  to  the  chip. 

3.  Apply  RESET  signal  with  REQ_INA=0,  REQ_INB=0,  ACKOUT=0,  BIDIR_SEL=1, 
TEST_SEL=00  and  DATA_IN  values  set  to  zero  until  ACK_INA,  ACK_INB  and 
REQOUT  all  stabilize  to  zero. 

4.  Input  data  with  alternating  REQ_INA  and  REQ_INB  signals,  as  shown  in  Table  B- 1 . 
Leave  REQ_IN  signal  high  and  keep  data  stable  until  appropriate  ACK_IN  signal  is 
asserted  then  lower  REQ_IN.  Do  not  REQ_IN  again  until  ACK_IN  goes  low. 

5.  When  REQ_OUT  goes  high  read  data,  then  pulse  ACK_OUT. 

6.  Use  BIDIR_SEL=0  and  desired  TEST_SEL  value  to  view  test  signals  at  any  time 
during  testing,  except  for  when  REQ_INA  or  REQ_I[SIB  is  high.  Refer  to  Table  B-3 
for  viewable  test  signals. 
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Table  B-2.  Fabricated  FFT-4  Pin  List 


PIN 

SIGNAL  IN 

DATA  OUT 

1 

PAD  VDD 

_ 

2 

- 

ACK.INB 

3 

“ 

DATA_OUT0 

4 

- 

DATA_OUT2 

5 

- 

DATA.OUTl 

6 

- 

PAD  TEST  OUT 

7 

PAD_TEST_IN 

8 

CORE  GND 

_ 

9 

CORE  VDD 

_ 

10 

DATA_IN5 

TEST  OUT2 

11 

DATA_IN4 

TEST  OUTl 

12 

DATA_IN3 

TEST  OUT5 

13 

DATA_IN2 

TEST  OUT6 

14 

DATA_INI 

TEST  OUTS 

15 

DATA.INO 

TEST  OUT7 

16 

PAD  GND 

. 

17 

PAD  GND 

18 

ACK_OUT 

19 

REQ.INB 

20 

TEST_SEL1 

21 

CORE  VDD 

22 

CORE  GND 

23 

TEST_SEL0 

24 

RESET 

25 

BIDIR_SEL  (1=  normal,  0=  view  test  signals) 

26 

REQ_INA 

_ 

27 

CORE  VDD 

_ 

28 

CORE  GND 

_ 

29 

DATA_IN6 

TEST  OUT4 

30 

DATA_IN7 

TEST  OUTO 

31 

DATA_IN14 

TEST  OUT3 

32 

PAD  VDD 

_ 

33 

PAD  VDD 

. 

34 

DATA_IN15 

TEST„OUT9 

35 

DATA_IN12 

TEST„OUT10 

36 

DATA_IN13 

TEST.OUTll 

37 

DATA_IN8 

TEST_OUT14 

38 

DATA_IN11 

TEST_OUTI5 

39 

DATA.INIO 

TEST_OUT13 

40 

PAD  VDD 

- 

41 

PAD  GND 

42 

DATA_IN9 

TEST_OUT12 

43 

- 

DATA_OUT10 

44 

- 

DATA_OUT9 

45 

> 

DATA.OUTl  1 

46 

- 

DATA_OUT8 

47 

- 

DATA.OUTl  2 

48 

PAD  GND 

- 

49 

PAD  GND 

- 

50 

- 

DATA_OUT13 

51 

- 

DATA_OUT14 

52 

- 

DATA_OUT15 

53 

CORE  GND 

- 

54 

CORE  VDD 

- 

55 

- 

ACK.INA 

56 

- 

DATA_OUT7 

57 

- 

DATA_OUT6 

58 

- 

DATA_OUT5 

59 

CORE  GND 

- 

60 

CORE  VDD 

- 

61 

- 

DATA_OUT4 

62 

- 

RECLOUT 

63 

- 

DATA_OUT3 

64 

PAD  VDD 

- 

B-2 


Table  B-3.  Test  Signals 


1  Test  Pin 

r  TEST_SEL 

Internal  Test  Signal 

AREQO 

AREQ2 

LREQO 

REOOUTOP 

TEST_OUTl 

00 

AACKO 

01 

AREQ3 

10 

LACKO 

11 

ACKREQOUTOP 

TEST_0UT2 

00 

AREQl 

01 

AREQ7 

10 

LREQl 

11 

MUXSTATE 

TEST_0UT3 

00 

AACKl 

01 

AACK2 

10 

LACKl 

11 

MREO 

TEST_0UT4 

00 

AREQ4 

01 

AACK3 

10 

LREQ2 

11 

MACK 

TEST_OUT5 

00 

AACK40 

01 

AACK7 

10 

LACK2 

11 

AREQSTATE 

TEST_OUT6 

00 

ADD_SUB0 

01 

ADD„SUB2 

10 

LREQ3 

11 

AACKSTATE 

TEST_OUT7 

00 

ADD.SUBl 

01 

ADD_SUB3 

10 

LACK3 

11 

SELO 

TEST_OUT8 

00 

ADD_SUB4 

01 

ADD_SUB7 

10 

REQREG 

11 

SELl 

TEST_OUT9 

00 

BUS0A15 

01 

BUS2A15 

10 

ACKREG 

11 

ADDSTATE 

TEST.OUTIO 

00 

BUS0B15 

01 

BUS2B15 

10 

GETREG 

11 

ARE04P 

TEST.OUTll 

00 

BUS1A15 

01 

BUS3A15 

10 

AREQ5 

11 

ARE05P 

TEST_OUT12 

00 

BUS1B15 

01 

BUS3B15 

10 

AACK5 

11 

AREQ6P 

TEST_OUT13 

00 

AC15 

01 

FH15 

10 

AREQ6 

11 

ARE07P 

TEST_OUT14 

00 

EG15 

01 

BD15 

10 

AACK6 

11 

RESETCIN 

TEST_OUT15 

00 

BUS415 

01 

BUS715 

10 

ADD_SUB5 

11 

ADD_SUB6 

Figure  B-1.  FFT-4  Test  Chip  HSPICE  Simulation  of  Timing 


Figure  B-2.  FFT-4  Test  Chip  Power 
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Appendix  C.  AFSM  Descriptions 


Table  C-1.  FFT-4  CONTROLIN  AFSM 


input  reqin  0 

input  reqdata  0 

input  lack  0 

output  ackin  0 

output  Ireq  0 

output  reqdataout  0 

7  7 

Current  Next 

Input  Burst 

1 

Output 

Burst 

7  7 

State  State 

1 

0  1 

reqin+  reqdata+ 

lreq+ 

1  2 

lack+ 

ackin+ 

2  3 

reqin- 

ackin- 

3  4 

reqin+ 

Ireq- 

4  5 

lack- 

ackin+ 

5  6 

reqin- 

ackin- 

6  7 

reqin+ 

lreq+ 

7  8 

lack+ 

ackin+ 

8  9 

reqin- 

ackin- 

reqdataout+ 

9  10 

reqin+  reqdata- 

Ireq- 

10  11 

lack- 

ackin+ 

11  0 

reqin- 

ackin- 

reqdataout- 

3D  Synthesized  Equations: 

ackin  = 

reqin  lack'  reqdataout  + 

/ 

reqin  lack  zzzOl 

+ 

reqin  lack  reqdataout'  zzzOO  + 

reqin  lack'  zzzOO'  zzzOl' 

Ireq  = 

/ 

reqin '  Ireq  + 

7 

ackin  Ireq  + 

7 

reqdata  reqdataout  + 

reqin  reqdataout 

'  zzzOO  + 

/ 

reqin  reqdata  zzzOl 

reqdataout  = 

/ 

reqin  reqdataout 

+ 

/ 

lack  reqdataout 

+ 

7 

reqin'  Ireq  zzzOO 

zzzOO  = 

reqin  zzzOO  + 

/ 

reqdata  zzzOO  + 

7 

Ireq  zzzOO  + 

reqin'  reqdata  Ireq'  zzzOl' 

7 

zzzOl  = 

7 

reqin  zzzOl  + 

7 

lack'  zzzOl  + 

reqin '  reqdata ' 

Ireq' 
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Table  C-2.  FFT-4  CONTROLGETREG  AFSM  Description 


input 

getreg 

0 

input 

lackO 

0 

input 

lackl 

0 

input 

lack2 

0 

input 

lack3 

0 

output 

ackreg 

0 

output 

IreqO 

0 

output 

Ireql 

0 

output 

lreq2 

0 

output 

lreq3 

0 

; ; ;  Current 

Next 

Input  Burst 

1  Output  Burst 

; ; ;  State 

State 

1 

0 

1 

getreg+ 

1  lreq0+ 

1 

2 

lack0+ 

1  IreqO- 

2 

3 

lackO- 

1  ackreg+ 

3 

4 

getreg- 

1  lreql+ 

4 

5 

lackl+ 

1  Ireql- 

5 

6 

lackl- 

1  ackreg- 

6 

7 

getreg+ 

1  lreq2+ 

7 

8 

lack2-i- 

I  lreq2- 

8 

9 

lack2- 

1  ackreg-b 

9 

10 

getreg- 

1  lreq3+ 

10 

11 

lack3+ 

1  lreq3- 

11 

0 

lack3- 

1  ackreg- 

;;;  3D 

Synthesized  Equations : 

; ; ;  ackreg  = 

lackl  + 
lack3  + 


; ; ;  lackO '  lack2 ' 

zzzOl 

;  ; ;  IreqO  = 

; ; ;  getreg  lackO ' 

zzzOO ' 

zzzOl ' 

;  ; ;  Ireql  = 

;  ; ;  getreg '  lackl ' 

lack3 

'  zzzOO'  zzzOl 

;  ; ;  lreq2  = 

;  ; ;  getreg  lack2 ' 

zzzOO 

zzzOl ' 

;  ; ;  lreq3  = 

; ; ;  getreg '  lackl ' 

lack3 

'  zzzOO  zzzOl 

; ;  zzzOO  = 

; ;  lackl  + 

;;  lack3 •  zzzOO 

; ;  zzzOl  = 

; ;  lackO  + 

; ;  lack2  + 

; ;  lackl '  lack3 ' 

zzzOl 
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Table  C-3.  FFT-4  CONTROLOUTA  AFSM  Description 


input  reqdataout  0 

input  done  0 

input  reset  0 

output  areq03  0 

output  areql2  0 

output  reqdata  0 

output  sel  0 

; 

;  Current  Next 

Input  Burst 

1  Output  Burst 

7  , 

;  State  State 

1 

0  1 

reset+ 

1  reqdata+ 

1  2 

reqdataout + 

1  areq03+  reqdata- 

2  3 

reqdataout- 

1  areql2+ 

3  4 

done+ 

1  areq03-  areql2- 

sel+ 

4  5 

done- 

1  areq03+  areql2+ 

5  6 

done+ 

1  areq03-  areql2- 

sel- 

6  1 

done- 

1  reqdata+ 

;  3D  Synthesized  Equations: 

/ 

/ 

/ 

;  areq03  = 

;  reqdataout  + 

/ 

;  done'  areq03  + 

7 

;  done '  sel 

;  areql2  = 

;  done '  sel  + 

7 

;  reqdataout '  done 

'  areq03 

;  reqdata  = 

- 

;  reqdataout'  done’  reset  areq03 '  sel' 

7  J 

;  sel  = 

/  j 

;  done '  sel  + 

$  i 

;  done  zzzOO'  + 

7  J 

;  sel  zzzOO' 

;  zzzOO  = 

;  done ’  sel  + 

;  done  zzzOO  + 

;  sel  zzzOO 
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Table  C-4.  FFT-4  CONTROLOUTB  AFSM  Description 


input  aacksl  0 
input  aacks2  0 
input  ackout  0 

output  areqs2  0 
output  reqout  0 
output  addsubs2  0 
output  done  0 


; ; ;  Current  Next 

; ; ;  State  State 


Input  Burst 


0  1 

1  2 

2  3 

3  4 

4  5 

5  6 

6  7 

7  8 

8  9 

9  0 


aacksl+ 

aacks2+ 

ackout+ 

ackout- 

aacks2- 

aacks2+ 

ackout+ 

ackout- 

aacks2- 

aacksl- 


I  Output  Burst 


I  areqs2+ 

1  reqout+ 

1  reqout - 

I  areqs2-  addsubs2+ 
I  areqs2+ 
j  reqout + 

I  reqout - 

I  areqs2-  addsubs2- 
1  done+ 

I  done- 


;;;  3D  Synthesized  Equations: 


; ; ;  areqs2  = 

; ; ;  ackout  + 

; ; ;  aacks2  *  addsubs2  + 

; ; ;  aacksl  zzzOO 

; ; ;  reqout  = 

;;;  aacks2  ackout’  zzzOO 

; ; ;  addsubs2  = 

; ; ;  ackout  addsubs2  + 

;;;  addsubs2  zzzOl’  + 

;;;  aacks2  ackout'  zzzOO'  zzzOl' 

; ; ;  done  = 

;;;  aacksl  aacks2 '  addsubs2 '  zzzOO' 

; ; ;  zzzOO  = 

;;;  aacksl’  + 

;  ;  ;  aacks2 '  addsubs2  + 

;;;  ackout*  zzzOO 

; ; ;  zzzOl  = 

; ; ;  ackout  addsubs2  + 

;;;  aacks2  zzzOl 
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Table  C-5.  Booth  Multiplier  CONTROLMULT  AFSM  Description 


input  multreq  0 
input  lackA  0 
input  boothack  0 

output  multack  0 
output  IreqA  0 
output  boothreq  0 
output  zero  0 


Current  Next 
State  State 


Input  Burst 


multreq+ 

lackA+ 

lackA- 

boothack+ 

boothack- 

multreq- 


Output  Burst 


lreqA+  zero-t 

IreqA-  zero- 

boothreq+ 

boothreq- 

multack+ 

multack- 


3D  Synthesized  Equations: 


multack  = 

multreq  boothack'  zzzOO 
IreqA  = 

multreq  lackA’  boothack'  zzzOO’  zzzOl' 
boothreq  = 

lackA'  boothack'  zzzOl 


multreq  lackA'  boothack'  zzzOO'  zzzOl' 


zzzOO  = 

boothack  + 
multreq  zzzOO 


zzzOl  = 
lackA  + 

boothack’  zzzOl 


Table  C-6.  Booth  Multiplier  CONTROLCALC  AFSM  Description 


input  calcreq 

0 

input  alulack 

0 

input  aack 

0 

input  lackB 

0 

input  sack 

0 

output  alulreq 

0 

output  areq 

0 

output  IreqB 

0 

output  sreq 

0 

output  calcack 

0 

; ; ;  Current 

Next 

Input  Burst 

Output  Burst 

; ; ;  State 

State 

0 

1 

calcreq+ 

alulreq+ 

1 

2 

alulack+ 

alulreq- 

2 

3 

alulack- 

areq+ 

3 

4 

aack+ 

lreqB+ 

4 

5 

lackB+ 

IreqB- 

5 

6 

lackB- 

areq- 

6 

7 

aack- 

sreq+ 

7 

8 

sack+ 

sreq- 

8 

9 

sack- 

calcack+ 

9 

0 

calcreq- 

calcack- 

;;;  3D  Synthesized  Equations: 

/  /  / 

; ; ;  alulreq  = 

; 7 ;  calcreq  alulack 

areq'  zzzOO'  zzzOl' 

; ; ;  areq  = 

; ; ;  lackB  + 

; ; ;  calcreq  alulack 

zzzOO'  zzzOl 

; ; ;  IreqB  = 

;;;  aack  lackB'  zzzOl 

; ; ;  sreq  = 

; ;  ;  calcreq 

aack'  sack'  zzzOO  zzzOl' 

;  ;  ;  calcack  = 

;  ;  ;  calcreq 

sack'  areq'  zzzOO  zzzOl 

;  ;  ;  zzzOO  = 

;  ; ;  lackB  + 

; ; ;  calcreq 

zzzOO 

;;;  zzzOl  = 

;  ; ;  alulack 

+ 

; 7  7  sack  + 

7  7 ;  calcreq 

lackB ' 

zzzOl 
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Table  C-7.  Booth  Multiplier  CONTROLBOOTH  AFSM  Description 


input  boothreq  0 
input  calcack  0 


output  boothack  0 
output  calcreq  0 


j  Output  Burst 


calcack- 

calcack+ 

calcack- 

calcack+ 

calcack- 

calcack+ 

calcack- 

calcack+ 

calcack- 

calcack+ 

calcack- 

calcack+ 

calcack- 

calcack+ 

calcack- 

boothroq- 


I  calcreq+ 
calcreq- 
I  calcreq+ 
calcreq- 
I  calcrGq+ 
calcreq- 
I  calcreq+ 
calcreq- 
I  calcreq+ 
calcreq- 
I  calcreq+ 
calcreq- 
I  calcrGq+ 
calcreq- 
I  calcreq+ 
calcreq- 
!  boothack+ 
I  boothack- 


;;;  3D  Synthesized  Equations: 


; ; ;  boothack  = 

;;;  boothreq  calcack*  zzzl2  zzzl3 

; ; ;  calcreq  = 

;;;  calcack*  ZZZl3'  + 

; ; ;  boothreq  calcack*  zzzl2' 

;  ; ;  zzzOO  = 

;;;  boothreq  ZZZOO  + 

;;;  calcack*  zzzl3 * 

;;;  ZZZOl  = 

; ; ;  calcack  zzzOO  + 

;;;  calcack'  zzzOl  + 

;;;  ZZZOO  ZZZOl 

; ; ;  ZZZ02  = 

;;;  boothreq  zzz02  + 

;;;  calcack*  ZZZOl  ZZZl3' 

! I ;  ZZ203  = 

; ; ;  calcack  zzz02  + 

;;;  calcack*  zzz03  + 

;;;  ZZZ02  ZZZ03 

;;;  ZZZ04  = 

; ; ;  boothreq  zzz04  + 

;;;  calcack'  2ZZ03  zzzl3 ' 

;;;  zzz05  = 

; ; ;  calcack  ZZ204  + 

;;;  calcack*  zzzOS  + 

;;;  zzz04  zzzOS 

; ;  ZZZ06  = 

; ; ;  boothreq  ZZzOG  + 

;;;  calcack*  zzz05  zzzl3’ 

; ; ;  ZZZ07  = 

; ; ;  calcack  ZZ206  + 

;;;  calcack*  zzzO?  + 

;;;  zzz06  zzzO? 

; ; ;  zzzOS  = 

;;;  boothreq  zzzOB  + 

;;;  calcack*  ZZZ07  zzzl3' 

;;;  zzz09  = 

; ; ;  calcack  zzzOB  + 

;;;  calcack*  ZZZ09  + 

;;;  zzzOB  zzz09 

; ; ;  ZZZlO  = 

;;;  boothreq  zzzlO  + 

;;;  calcack'  ZZZ09  zzzl3 ' 

; ; ;  zzzll  = 

;;;  calcack  zzzlO  + 

;;;  calcack*  ZZZll  + 
zzzlO  zzzll 

;  ; ;  ZZ2l2  = 

; ; ;  boothreq  zzzl2  + 

;;;  calcack*  zzzll  zzzl3 ' 


zzzl3  = 

calcack  zzzl2  + 
calcack*  zzzl3  + 
ZZZ12  zzzl3 
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Appendix  D.  Simulation  Results 
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Figure  D-1.  VHDL  Simulation  of  Decimator  Timing 
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Figure  D-2.  IRSIM  Simulation  of  Decimator  Timing 


Figure  D-3.  HSPICE  Simulation  of  Decimator  Timing 
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Figure  D-6.  IRSIM  Simulation  of  Crossbar  Timing 


Figure  D-7.  HSPICE  Simulation  of  Crossbar  Timing 
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Figure  D-9.  VHDL  Simulation  of  Expander  Timing 
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Figure  D-10.  IRSIM  Simulation  of  Expander  Timing 


Figure  D- 1 1 .  HSPICE  Simulation  of  Expander  Timing 
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Figure  D-12.  HSPICE  Simulation  of  Expander  Power 
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Figure  D- 1 3 .  VHDL  Simulation  of  Reorder  Register  Timing 


RESET 

Esa 

S2, 

- ^ _  1 

DhTh.IDF: 

xxxx 

1  0000  1 

E 

0002 

T“ 

0001 

1 

0003 

DhTh.IHI 

xxxx 

1  0000  1 

E 

0002 

1 

0001 

J 

0003 

DhThjJIJTR 

xxxx 

1  0000 

L 

0002 

1 

0001 

11^9 

1  0002 

1  0003 

DHTHJJUTI 

xxxx 

1  0000 

n 

0002 

1 

0001 

1 

■ 

1  0002  1 

0003 

REOirJ 

]  1  1 

r 

L 

f"^ 

1 

J — 

i_ 

Hi  KIN 

. 1  1 

J - 

L 

J  1 

. n 

REDOUT 

m _ 

J  1 

1 

r“ 

1  1 

1 

hC!;oijt 

J  L 

1  1 

1  1 

Figure  D-14.  IRSIM  Simulation  of  Reorder  Register  Timing 


D-5 


Figure  D-15.  HSPICE  Simulation  of  Reorder  Register  Timing 
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Figure  D-18.  IRSM  Simulation  of  FFT-4  Timing 


Figure  D-19.  HSPICE  Simulation  of  FFT-4  Timing 


Figure  D-20.  HSPICE  Simulation  of  FFT-4  Power 
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Figure  D-21.  VHDL  Simulation  of  Complex  Multiplier  Timing 


Figure  D-23.  HSPICE  Simulation  of  Complex  Multiplier  Timing 
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Figure  D-24.  HSPICE  Simulation  of  Complex  Multiplier  Power 
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Figure  D-25.  VHDL  Simulation  of  FFT-16  Timing 


Figure  D-26.  IRSIM  Simulation  of  FFT-16  Timing 
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Figure  D-27.  FFT-16  VHDL  Simulation  of  Impulse  Response 


Figure  D-28.  FFT-16  IRSIM  Simulation  of  Impulse  Response 
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